Data Domain General Health Check
Summary: Summary: This document provides actions that Tech Support would complete when performing a general health check on a Data Domain (DD) System. It includes general commands and outputs to help identify alerts or misconfigurations. ...
Instructions
Applies to:
- All Data Domain Operating System (DDOS) versions
- All current models
12-Step Health Check:
Step 1 - connect to the DD system by using SSH (for example PuTTY) as an administrative user.
Step 2 - Ensure that the Filesystem is enabled.
# system show serialno # date # filesys status
The filesystem is enabled and running.
Step 3 - Ensure that the DDOS version is supported for the DD model.
# system show model # system show version
Article 81247: DDOS Software Versions
Step 4 - Any alerts that impact the health of the system must be addressed.
# alerts show current
Article: 14723: Data Domain - How to Check Alerts on a Data Domain System.
Step 5 - Ensure that /data is below 90%.
To maintain expected performance levels, Data Domain recommendation is to always keep the 'use%' below 90%.
# df
Example Output:
Active Tier: Resource Size GiB Used GiB Avail GiB Use% Cleanable GiB* ---------------- -------- --------- --------- ---- -------------- /data: pre-comp - 7259347.5 - - - /data: post-comp 304690.8 251252.4 53438.5 82% 51616.1 /ddvar 29.5 12.5 15.6 44% - ---------------- -------- --------- --------- ---- --------------
Article 54303: Data Domain: How to resolve capacity issues.
Step 6a - Ensure that there are no Failed (F), Reconstructing (R) or Absent disks (A).
# disk show state
Example Output:
sysadmin## disk show state Enclosure Disk 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 --------- ------------------------------------------------ 1 . . . . s . . . . . . . 2 . . . . . . . . . A . . . . S R 3 E . . . . . . . . C . . . . . . --------- ------------------------------------------------ Legend State Count ------ ------------ ----- . In Use Disks 25 s Spare Disks 1 R Spare (reconstructing) Disks 1 C Copy Recovery Disks 1 A Absent Disks 1 E Exceeded Error Threshold ------ ------------ -----
Article: 21916: Data Domain - Disk State Description
Step 6b: Check disk reliability output to see if proactive disk replacement is needed.
Ensure that there are no disks with "Reallocated Sectors" above 1000 or increasing daily.
# disk show reliability-data
Example Output:
Disk Show Reliability-Data -------------------------- Disk ATA Bus Reallocated Temperature (enc/disk) CRC Err Sectors ---------- ------- ----------- ----------- 1.1 0 0 29 C 84 F 1.2 0 0 29 C 84 F 1.3 0 0 29 C 84 F 1.4 0 0 27 C 81 F 2.1 0 0 26 C 79 F 2.2 0 0 25 C 77 F 2.3 0 0 24 C 75 F 2.4 0 0 24 C 75 F 2.5 89 0 25 C 77 F 2.6 0 0 25 C 77 F 2.7 0 3156 24 C 75 F 2.8 0 0 23 C 73 F 2.9 0 0 24 C 75 F 2.10 0 0 24 C 75 F 2.11 0 0 23 C 73 F 2.12 0 0 23 C 73 F 2.13 0 0 25 C 77 F 2.14 0 0 24 C 75 F 2.15 0 0 22 C 72 F 2.16 0 0 22 C 72 F
Step 7 - Test communications on the ports with cables connected for 5 minutes. If there is an error, it is recommended to reseat the cables or LCCs.
# enclosure show topology # enclosure test topology port 5 minutes
Article: 35680: Data Domain: SAS Cable Configuration, Topology checks and Testing
Step 8 - System misconfiguration: If the output indicates one or more component errors, it must be addressed.
# enclosure show misconfiguration
Example Output:
Enclosure Show Misconfiguration ------------------------------- Memory Risers: No misconfiguration found. Memory DIMMs: No misconfiguration found. IO Cards: No misconfiguration found. CPUs: No misconfiguration found. Disks: No misconfiguration found.
Step 9 - If replication is configured, check for any errors. If there is an error, it must be addressed.
# replication status
Article: 43349: Data Domain - Replication Status
Step 10 - If the VTL library is in use.
# vtl status
Article: 12128: Troubleshooting Data Domain VTL Target Visibility
Step 11 - If High Availability System (HA)
# ha status
Example Output:
SE@apollo-440-n1-p0(active:0)## ha status HA System name: apollo-440-n1.chaos.local HA System status: highly available Node Name Node id Role HA State ------------------------------- ------- ------- -------- apollo-440-n1-p0.chaos.local 0 active online apollo-440-n1-p1.chaos.local 1 standby online ------------------------------- ------- ------- --------
# ha status detailed
Example Output:
SE@apollo-440-n1-p0(active:0)## ha status detailed HA System name: apollo-440-n1.chaos.local HA System Status: highly available Interconnect Status: ok Primary Heartbeat Status: ok External LAN Heartbeat Status: not ok Hardware compatibility check: ok Software Version Check: ok Node apollo-440-n1-p0.chaos.local: Role: active HA State: online Node Health: ok Node apollo-440-n1-p1.chaos.local: Role: standby HA State: online Node Health: ok Mirroring Status: Component Name Status -------------- ------ nvram ok registry ok sms ok ddboost ok cifs ok -------------- ------
Article 17861: Healthcheck for Data Domain HA (DDHA) appliances
Step 12 - Run Hardware health check (DDOS >= 8.3.x)
# support hardware healthcheck
HARDWARE Health Check Summary:
+-------------------+--------+
| Component | Status |
+-------------------+--------+
| Storage Disk | PASS |
| Power-Supply Unit | PASS |
| FAN | PASS |
| SAS Controller | PASS |
| QAT | PASS |
| NvRAM | PASS |
| DIMMs | PASS |
| IO Cards | PASS |
| CPU | PASS |
| NIC H/W Errors | PASS |
+-------------------+--------+
TSR logs:
Special consideration for Dell PowerEdge Based Data Domain systems (for Example: DD6900, DD9400, DD9900, DD3300, and newer)
Connect to iDRAC and check system status and health - gather a TSE Log (if necessary).
Article 21925: Data Domain: How to Collect a TSR Log.
Final Step for Recertification request - Finally reboot the system and once the system is back online check for current alerts. Any alerts that impact the health of the system must be addressed.
If any further assistance is required, please open a Service Request with your Support Provider.