IDPA: IDPA deployment for DP8800 fails at 63% with error "Failed: Configuring Protection Storage. Error: Failed to enable file system of data domain after reboot"
Summary: This issue is seen on IDPA DP8800 appliances running code 2.3 which comes with Data Domain version 6.2.0.5. This Data Domain version has a bug which causes false positive alerts on Data Domain regarding Storage processor failed causing the Data Domain file system to go down. ...
Symptoms
The ACM UI shows the following Error for failed deployment:
The Diagnostic report shows the following error:
It is noted that Data Domain deployment reaches 98% and then Data domain reboots as a part of the workflow.
After this reboot, Data Domain File System does not come up.
There may be hardware errors seen on the Data Domain regarding Storage Processors:
Id Post Time Severity Class Object Message
----- ------------------------ -------- --------------- ----------- -------------------------------------------------------
p0-67 Mon Jan 6 12:15:07 2020 CRITICAL HardwareFailure Enclosure=1 EVT-ENVIRONMENT-00032: The storage processor has failed
----- ------------------------ -------- --------------- ----------- -------------------------------------------------------
There is 1 active alert.
The ACM server.log shows the following error message:
2020-01-06 18:43:16,694 INFO [pool-67-thread-3]-util.SSHUtil: STDERR : []
2020-01-06 18:43:16,694 INFO [pool-67-thread-3]-util.SSHUtil: Successfully executed remote command using SSH.
2020-01-06 18:43:16,694 INFO [pool-67-thread-3]-ddadapter.ConfigDataDomainTask: Successfully executed: filesys status
2020-01-06 18:43:16,694 ERROR [pool-67-thread-3]-ddadapter.ConfigDataDomainTask: File system is not enabled or not running after rebooting
Cause
Affected IDPA Component: Data Domain
Data Domain on IDPA DP8800 appliance, is experiencing memory subsystem and SP failure alerts, leading to system reboot. In most instances, the DD9800 which is part of IDPA DP8800 appliance, will produce a combination of hardware alert messages, which can falsely indict multiple hardware components. It is typical to see a combination of the following hardware alerts. This article is intended to help troubleshoot this issue when it occurs, and also apply a workaround to disable the background memory scrubbing function to prevent random reboots. These hardware alerts tend to come up after system reboots.
Id Post Time Severity Class Object Message
------ ------------------------ -------- --------------- ---------------------------- -----------------------------------------------------------------------------------------
p0-279 Thu Mar 15 14:27:33 2018 CRITICAL HardwareFailure Enclosure=1:Slot=0 EVT-ENVIRONMENT-00029: I/O module has failed
p0-280 Thu Mar 15 14:27:52 2018 CRITICAL HardwareFailure Enclosure=1:Slot=1 EVT-ENVIRONMENT-00029: I/O module has failed
p0-281 Thu Mar 15 14:27:54 2018 CRITICAL HardwareFailure Enclosure=1:Slot=2 EVT-ENVIRONMENT-00029: I/O module has failed
p0-282 Thu Mar 15 14:27:55 2018 CRITICAL HardwareFailure Enclosure=1:Slot=3 EVT-ENVIRONMENT-00029: I/O module has failed
p0-283 Thu Mar 15 14:27:56 2018 CRITICAL HardwareFailure Enclosure=1:Slot=4 EVT-ENVIRONMENT-00029: I/O module has failed
p0-284 Thu Mar 15 14:27:57 2018 CRITICAL HardwareFailure Enclosure=1:Slot=5 EVT-ENVIRONMENT-00029: I/O module has failed
p0-285 Thu Mar 15 14:27:59 2018 CRITICAL HardwareFailure Enclosure=1:Slot=6 EVT-ENVIRONMENT-00029: I/O module has failed
p0-286 Thu Mar 15 14:28:00 2018 CRITICAL HardwareFailure Enclosure=1:Slot=7 EVT-ENVIRONMENT-00029: I/O module has failed
p0-287 Thu Mar 15 14:28:01 2018 CRITICAL HardwareFailure Enclosure=1:Slot=8 EVT-ENVIRONMENT-00029: I/O module has failed
p0-288 Thu Mar 15 14:28:02 2018 CRITICAL HardwareFailure Enclosure=1:Slot=10 EVT-ENVIRONMENT-00029: I/O module has failed
p0-289 Thu Mar 15 14:28:04 2018 CRITICAL HardwareFailure Enclosure=1:DIMM=MR4 DIMM A1 EVT-DIMM-00003: A memory card has failed
p0-290 Thu Mar 15 14:28:05 2018 CRITICAL HardwareFailure Enclosure=1:Riser=4 EVT-ENVIRONMENT-00044: Memory riser fault has been detected
m0-30 Thu Mar 15 06:48:33 2018 WARNING Filesystem EVT-GC-00002: Unable to start scheduled file system cleaning on Thu Mar 15 06:01:00 2018.
p0-154 Thu Mar 15 15:49:08 2017 INFO Filesystem EVT-FILESYS-00012: System rebooted
p0-318 Fri Mar 16 09:47:16 2018 CRITICAL HardwareFailure Enclosure=1:Riser=4 EVT-ENVIRONMENT-00044: Memory riser fault has been detected
p0-277 Thu Mar 15 12:04:47 2018 CRITICAL HardwareFailure Enclosure=1 EVT-ENVIRONMENT-00032: The storage processor has failed "voltage is faulty"
------ ------------------------ -------- --------------- ---------------------------- -----------------------------------------------------------------------------------------
The bios.txt log on Data Domain shows the following errors:2 | 03/15/2018 | 05:33:41 | CPU Status Events CPU2_Status | CPU IERR | Asserted | CPU External IERR
3 | 03/15/2018 | 05:33:41 | Entering IERR Interrupt Events Enter_SMI | IERR Interrupt | Asserted | Used AUX Log (LSB 0x24) Used AUX Log (MSB 0x0)
4 | 03/15/2018 | 05:33:42 | BMC Chassis Ctrl Events BMC_Chassis_Ctrl | Reset through BMC | Asserted
To troubleshoot this scenario, focus on the CPU Patrol Scrub function on IDPA Protection Storage (Data Domain), which can incorrectly report memory DIMMS are faulty, and can also indict the incorrect DIMM.
SP (storage processor) replacements, Mech Replacements, and mass memory replacements, have all been proven unnecessary in resolving this problem.
Resolution
Method 1:
- Disable Patrol Scrub on the Data Domain System part of IDPA. (See Notes sections for steps).
- Clear Indict List on Data Domain:
se indict remove <id>
- Clear Data Domain Active Alerts related to Hardware Failures:
alerts clear <alert-id>
- Perform a system reboot to confirm the Data Domain File system Comes up.
-
Hit 'Retry' on the ACM UI and retry the deployment if the Data Domain File system comes up clean as per step 4.
-
Upgrade to fixed DDOS version 6.2.0.30.
-
Hit 'Retry' on the ACM UI and retry the IDPA deployment.
Additional Information
This content is translated in different languages:
| https://downloads.dell.com/TranslatedPDF/AR-SA_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/DE_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/ES_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/ES-XL_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/FR_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/IT_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/JA_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/KO_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/NL_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/PT_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/PT-BR_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/RU_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/SV_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/ZH-CN_540157.pdf |
| https://downloads.dell.com/TranslatedPDF/ZH-TW_540157.pdf |
Workaround:
In DDOS directory /ddr/firmware/JUPITER there is a utility to flash the DD9800 BIOS configuration settings.
Use the following command to dump current BIOS setting into an ASCII file:
./SCELNX_64 /o /s ./ps_enabled.txt
This will generate a new text file named ps_enabled.txt.
View ps_enabled.txt with VI text editor.
Search down the text file for the word "Patrol"
Note: The asterisk (*) next to the Option =*[01]Enable means the Patrol Scrub function is Enabled.
Setup Question = Patrol Scrub
Options =*[01]Enable // Move "*" to the desired Option
[00]Disable
*******************************
With a VI text editing tool, change the Patrol Scrub setting to disable as shown below. Delete the asterisk next to Enabled, and enter an asterisk next to Disabled.
Setup Question = Patrol Scrub
Options =[01]Enable // Move "*" to the desired Option
*[00]Disable
Write and save (:wq) the changed to a new file named ps_disabled.txt
*******************************
In folder /ddr/firmware/JUPITER, there should now be two BIOS configuration files
ps_enabled.txt
ps_disabled.txt
*******************************
Load the edited config file ps_disabled.txt into BIOS:
./SCELNX_64 /i /s ./ps_disabled.txt
Note: The following errors may be seen and they can be ignored:
Example:
!!!!xxxxxxxxYOUR DATA IS IN DANGER !!!! # ./SCELNX_64 /i /s ./ps_disabled.txt
----------------------------------------------------------------------------
| Copyright (c)2014 American Megatrends, Inc. |
| AMISCE Utility. Ver 5.01.1073 |
----------------------------------------------------------------------------
Warning in line 23600
Missing Current Setting "*"
WARNING : Length of string for control (User Name) not updated as the value/defaults specified in the script file doesn't reach the minimum range (1).
WARNING : Length of string for control (User Name) not updated as the value/defaults specified in the script file doesn't reach the minimum range (1).
WARNING : Length of string for control (User Name) not updated as the value/defaults specified in the script file doesn't reach the minimum range (1).
WARNING : Error in writing variable PNP0501_0_NV to NVRAM
WARNING : Error in writing variable SecureBootSetup to NVRAM
Import completed with some errors, see warnings given.
*******************************
Reboot system with DDOS command:
#system reboot
This will force the new BIOS configuration settings into BIOS during the reboot.
*******************************
Once system is rebooted, from BASH shell, verify the Patrol Scrub setting has changed to disabled:
./SCELNX_64 /o /s ./bios_config.txt
Using VI text editor, search and verify the file 'bios_config.txt' to check if Patrol Scrub function maintained the asterisk next to disabled after the reboot.
Setup Question = Patrol Scrub
Options =[01]Enable // Move "*" to the desired Option
*[00]Disable
Now, the Patrol Scrub function has been permanently disabled after this workaround is applied. Future OS and BIOS versions will incorporate this change automatically.
Upgrading to a newer IDPA version will be seamless, and does not require removal of this workaround.