Data Domain: Reboot Loop After Controller Upgrade - Out of Memory and No Killable Processes
Summary: Following a controller upgrade, the system keeps rebooting within 5 minutes of enabling the file system (FS). The root cause is an Out of Memory (OOM) condition, causing a kernel panic. The system reports 'Out of memory' caused by an invalid registry setting. This can be resolved by removing the 'system.MEM_HUGETLB=FALSE' registry key and rebooting the system. ...
Symptoms
Symptoms:
- DD keeps rebooting after controller upgrade; when the Filesystem (FS) is enabled the DD reboots within 5 minutes.
- Disable the FS to prevent the reboot loop (Kernel Panic)
- Kernel Panic messages are present in the logs.
- Out of Memory errors are present in the logs.
- In kern.info: 'Kernel panic - not syncing: Out of memory and no killable processes'
Kern.info shows 'Out of Memory' errors:
Aug 26 14:45:05 xxxx kernel: [ 1332.027261] (E4)Out of memory: Kill process 4769 (java) score 1 or sacrifice child Aug 26 14:45:05 xxxx kernel: [ 1332.044831] (E4)Out of memory: Kill process 22332 (sms) score 0 or sacrifice child Aug 26 14:45:08 xxxx kernel: [ 1335.305280] (E4)Out of memory: Kill process 6131 (sms) score 0 or sacrifice child Aug 26 14:45:08 xxxx kernel: [ 1335.321218] (E4)Out of memory: Kill process 5647 (lwsmd) score 0 or sacrifice child Aug 26 14:45:08 xxxx kernel: [ 1335.324153] (E4)Out of memory: Kill process 22442 (dd_usm) score 0 or sacrifice child Aug 26 14:45:08 xxxx kernel: [ 1335.325088] (E4)Out of memory: Kill process 25402 (dd_ha_vol-ha_li) score 0 or sacrifice child Aug 26 14:45:08 xxxx kernel: [ 1335.326060] (E4)Out of memory: Kill process 22459 (csmd) score 0 or sacrifice child Aug 26 14:45:12 xxxx kernel: [ 1338.519181] (E4)Out of memory: Kill process 6415 (lwsmd) score 0 or sacrifice child Aug 26 14:45:12 xxxx kernel: [ 1338.522521] (E4)Out of memory: Kill process 6412 (sms) score 0 or sacrifice child
Cause
The system keeps rebooting after a controller upgrade (for example from DD9300 to DD9900). The file system was disabled to prevent the DD from rebooting in a loop.
The kernel logs show multiple Out of Memory (OOM) errors, which trigger the kernel panic and subsequent reboots.
The root cause of the issue is the insufficient memory available for the system to function properly. This could be due to several reasons, including but not limited to:
- A memory leak in the system software
- Insufficient memory allocated for specific processes or services
- Incorrect system configuration leading to excessive memory usage
- Hardware issues, such as faulty memory modules or other components
- An invalid registry key may have been set, support must remove this key; system.MEM_HUGETLB=FALSE.
Further investigation is required to identify the exact cause of the memory exhaustion and address it accordingly.
Review system logs and error messages to identify any specific processes or services that might be consuming excessive memory and causing the OOM errors.
Also, checking the system's memory usage and configuration can help identify any misconfigurations or hardware issues that might be contributing to the problem.
For Example: Missing or misplaced DIMMs could result in an unsupported configuration; which prevents the FS from starting.
Resolution
- Check the system logs for any error messages or warnings related to memory usage or system misconfiguration; address these accordingly.
- If the issue still persists, consider contacting Dell Support for further assistance. Be sure to provide relevant system logs or diagnostic information to help troubleshoot the issue.
- Support Bundle and relevant Core|Kernel Dump files to be uploaded