Dell VxRail: health-check 'ism_fix' or 'rac_fix' correcting iSM and iDRAC issues
Summary: VxVerify on VxRail Manager can attempt to correct iDRAC and iSM fault by restarting iDRAC and related VxRail node services.
Symptoms
Before running tests directly on each node, using VxVerify minion, VxVerify on VxRail Manager first queries the Dell iSM (dcism or dellism).
Alternatively, if iDRAC issues were found when running health-checks, this Autofix is attempted before retrying the health-checks.
If the Autofix option is enabled (either by the test profile or with argument --fix), the attempt to correct this takes around 10 minutes.
The result of this auto-correction is listed as one of the following:
|
Test Result
|
Result code
|
Result Interpretation
|
|
Pass
|
0
|
Correcting iSM status was either unnecessary or not enabled under the test profile.
|
|
Warning
|
1
|
Dell iSM status was running correctly after the restart. |
| Failure | 2 |
Dell iSM and iDRAC were restarted, but iSM was still not running correctly afterwards.
|
| Critical | 3 |
This test has no critical result.
|
Each test that passes is not listed in the summary report, for ease of reading.
An example of the health-check output is shown below:
#========================#======#=========#====================================================================#==============# | Hostname / Category |Status Dell_KB | Warnings or Failures, unless tests Passed ; Product S.N. | #========================#======#=========#====================================================================#==============# | _cluster | Warning 205179 | ism_fix: iSM and iDRAC fixed for node1.lab.local, node4.lab.local .| | `` | Warning 205179 | rac_fix: iSM and iDRAC fixed for node2.lab.local |The 'ism_fix' operation runs prior to the minions and the fix commands are run remotely from VxRM using SSH. For example:
Running VxVerify 3.21.108, pre-upgrade healthcheck on VxRail 7.0.372. In case of program errors consult article https://www.dell.com/support/kbdoc/000066460. Step 1: Fixing iSM issue, prior to running health-checks, on node: lab-08-esxi-01.lab.local Step 1: Fixing iSM issue, prior to running health-checks, on node: lab-08-esxi-02.lab.local Step 1: Stopping ISM and platform service on lab-08-esxi-01.lab.local Step 1: Stopping ISM and platform service on lab-08-esxi-02.lab.local Step 1: Pausing for 266 seconds more after iDRAC restarted on ['lab-08-esxi-01.lab.local', 'lab-08-esxi-02.lab.local'] ... Step 1: Starting iSM on lab-08-esxi-01.lab.local Step 1: Starting iSM on lab-08-esxi-02.lab.local Step 1: Pausing for 84 seconds more after Dell iSM started on ['lab-08-esxi-01.lab.local', 'lab-08-esxi-02.lab.local'] ... Step 1: Starting Platform service on lab-08-esxi-01.lab.local Step 1: Starting Platform service on lab-08-esxi-02.lab.localThe Autofix can also be seen in the vxv.log prior to the minion_run events:
2022-11-11 09:51:26-INFO [ism_fix] Fixing phase 1 Dell ISM on node on lab-08-esxi-01.lab.local 2022-11-11 09:51:31-INFO [ism_fix] lab-08-esxi-01.lab.local Auto-fix continuing with vSAN objecthealth: green 2022-11-11 09:51:32-INFO [ism_fix] iDRAC restarting on lab-08-esxi-01.lab.local: _ ... 2022-11-11 09:58:58-INFO [ism_fix] Checking hosts for auto-fix success: ['lab-08-esxi-01.lab.local', 'lab-08-esxi-02.lab.local']
Cause
- Stop services: sfcbd, dcism, PTAgent (if present) & Platform-service
- Restart iDRAC, then wait 5 minutes for iDRAC to come back online
- Start services (listed above)
Resolution
The iSM status is retested using the ‘dcism’ health-check directly on that node. This can report a different result, because this is polled a few minutes after the Autofix. If the result does differ, the ‘dcism’ test should be viewed as the more accurate result for the status of iSM.
The results of the commands to start the services can be found in the vxv.log (see article 66460: VxVerify Troubleshooting Guide ).
2022-11-25 09:16:26-DEBUG [ism_fix] node-04.lab.local iSM start: _ 2022-11-25 09:18:26-DEBUG [ism_fix] node-04.lab.local Platform service start: Starting Platform Service Daemon. Check hostd status. hostd is ready. Platform Service started. 2022-11-25 09:18:26-INFO [ism_fix] Checking hosts for auto-fix success: ['node-04.lab.local'] 2022-11-25 09:18:26-INFO [ism_check] Querying DC or Dell ISM status on host 2022-11-25 09:18:26-INFO [ism_check] iSM status on node-04.lab.local : iSM is active (running)
If iSM cannot be fixed by the steps above, which the health-check can run automatically, then see article: Dell VxRail: Node health-check fails for test 'dcism'
Additional Information
Force use of ism_fix (iDRAC restart)
The Autofix runs if 'dcism' or 'dellism' are not running, when they are queried from VxRM. However, this only applies if the test profile or --fix argument enables the Autofix.
Alternatively, an iDRAC restart may be recommended to address other issues and so the Autofix can be enabled over a VxVerify argument.
This is a safer way to recover iDRAC communication, than a restart directly from the iDRAC UI, because VxVerify will shutdown the iSM and related services, before restarting iDRAC, and then bring services back up in the correct order afterwards.
The override argument can either request all nodes have a staggered iDRAC restart, or for a list of specific nodes.
To apply the fix to nodes (even if iSM is running normally), which will restart iDRAC and the related services:
-
Either, apply force the iSM and iDRAC restart procedure ('ism_fix'), to all nodes:
./vxverify.sh -a ism_fix=all
-
Or, apply 'ism_fix' to specified nodes in a list (no spaces) (either short or fully qualified names will work):
python vxverify3.pyc <any_other_arguments> -a ism_fix=lab-08-esxi-01,lab-08-esxi-02
Examples above show the Shell and Python methods of running VxVerify, but the arguments will work with either syntax.
The -a argument (--additional-params), allows an unlimited number of argument pairs to be specified, so it must come after all other standard arguments, such as --verbose
When this argument is used, the override can be seen in the vxv.log as follows:
INFO [ism_fix] Running fix for Dell ISM on node: lab-08-esxi-01, due to override argument: lab-08-esxi-01.lab.local,lab-08-esxi-02.lab.local or INFO [ism_fix] Running fix for Dell ISM on node: lab-08-esxi-02, due to override argument: all