PowerEdge: AMD erratum 1474 description
Summary: An AMD CPU core may stop responding after about 1044 days according to AMD erratum 1474.
Symptoms
This issue affects the AMD EPYC™ 7002 Series (Rome). See the table of CPUs below for reference.
A core fails to exit core-C6 (CC6) sleep state around 1044 days after the last reboot.
The time of failure varies depending on the spread spectrum and REFCLK frequency.
The following symptoms are not exhaustive, but may help to identify the issue:
- On Windows, the system stops responding with blue screen which shows Bug Check 0x101
- On Linux, there are no obvious symptoms
- Uptime is beyond 1044 days. This condition is the major indicator of the AMD erratum 1474 issue.
| DPN | Model Name |
YVKJ6 |
7742 |
C59HD |
7642 |
8JWMD |
7542 |
5PG5C |
7702 |
835TD |
7702P |
3J0XY |
7552 |
FG4GY |
7502 |
3NFJT |
7502P |
542T2 |
7402 |
YK5KC |
7402P |
Y96PT |
7452 |
F9NJ5 |
7352 |
V99P3 |
7302 |
3425F |
7302P |
XPY7D |
7262 |
J1X8V |
7282 |
XJG06 |
7252 |
DH26K |
7232P |
V0K1X |
7272 |
GX27F |
7662 |
CPHXD |
7532 |
P5HDY |
7F72 |
HVVJX |
7F52 |
PDC7R |
7F32 |
MTHGK |
7H12 |
Cause
It is a public information provided by AMD as the following link (page 55), and the purpose of this PSQN is to remind TS and customer when a system encounter a hang-like issue after 1044 days of uptime, may relate to the root cause of this AMD erratum describes.
https://www.amd.com/system/files/TechDocs/56323-PUB_1.01.pdf
Resolution
There are two workarounds:
- Option 1: Disable C-state in the BIOS to prevent the CPU core entering cc6 state.
- Option 2: Reboot the system before it has an uptime of 1044 days. This could be a warm or cold reboot.
When the system stops responding uptime is above 1044 days, a single reboot can work around the issue.
A further reboot must occur within the subsequent 1044 days when the counter once again resets.