VxRail: Node triggers high inlet temperature alert
Summary: VxRail node reports high Inlet temperature alerts. This is usually due to an environment factor such as air-conditioner problem.
Symptoms
The VxRail node triggers these alerts in the life cycle controller:
2024-06-03 02:18:00 2586 TMPS0103 Inlet temperature is above critical level for extended duration.
2024-05-07 08:41:37 355 TMP0121 The system inlet temperature is greater than the upper critical threshold.
The event log generates the matching event entries:
2024-05-07 04:49:36 7 The system inlet temperature is within range.
2024-05-07 04:47:19 6 The system inlet temperature is greater than the upper warning threshold.
2024-05-06 19:41:37 5 The system inlet temperature is greater than the upper critical threshold.
2024-05-06 19:12:49 4 The system inlet temperature is greater than the upper warning threshold.
If the server is under the critical event, it would be automatically running in a degraded mode. If the situation lasts a long time, it shuts down.
In this screenshot, the iDRAC log would read the temperature on the CPU or system board along with their warning and critical threshold. 38 as warning and 42 as critical.
Cause
This is because of the environmental situation that the ventilation is not good. This causes the VxRail node to generate a high temperature. When the fan module is unable to adjust the speed to cool down the internal component temperature, the thermal event causes the server to run in a degraded mode and shuts down the server to avoid hardware damage. This operation depends on the setting of temperature alert setting in the iDRAC.
Inlet high temperature: If the temperature alert is not set, then when it reaches to 42 degrees or above for an extended time it first runs in degraded mode and tries to use the fan module to cool down the server. After an extended time, it shuts down the server.
Resolution
- VxRail nodes have an internal mechanism to deal with the poor environmental situation with its fan module and with the definition thresholds of warning and critical. As mentioned above after running into critical:
A. Under iDRAC->configuration->system settings->alert configuration->alerts->alert configuration -> expand the temperature. If the first line critical is with Power off, after reaching the critical temperature it would immediately shut down by CPU thermal trip.
The following iDRAC command would turn out to have the same effect:
racadm>>racadm eventfilters get -c idrac.alert.system.TMP.critical
B. If this parameter is No Action, the iDRAC tries to adjust the fan module to cool down the system first. After it has been running extended cycles, a CPU thermal trip would power off the server to avoid hardware component damage by continuous temperature.
2. To avoid this high Inlet temperature, customers must ensure that inlet temperatures are within in range for optimum performance.