Health check for Data Domain HA (DDHA) appliances
Summary: This article is intended to provide guidance to perform a basic HA system healthcheck after a service event. Data Domain Highly Available (DDHA) configurations vary depending on the Data Domain models used. ...
This article applies to
This article does not apply to
This article is not tied to any specific product.
Not all product versions are identified in this article.
Symptoms
Data Domain Highly Available systems (DDHA) are designed to fail over between nodes. Only the active node is in production while the standby node is awaiting a failure event to take the place of the active node (ACTIVE - PASSIVE).
It is imperative to check that both DDHA nodes are in working order and complete a fail-over, if a failure were to occur.
The CLI commands detailed in this article assist to uncover possible issues that could prevent a successful fail-over.
This guide is broken up into key areas that should be checked.
# net show settings
Network port settings are different, depending on which node the #net show settings command is run. Configured ports on DDHA systems are with a type "floating" or type "fixed." Run
Active node:
Verify network connectivity.
Review the IPs listed on each node, and ensure that each configured IP address on the active node and standby nodes can ping its configured gateway.
Note: Some customers have the ping (ICMP) disabled in their environment. In this event, engage the customer to confirm connectivity.
(active:1)# net route show gateway detailed
Ping the gateway IP address with each configured ethxx.
# net troubleshooting duplicate-ip
From both nodes check for duplicate IPs
Fiber Channel Testing
Verify that these features are licensed and then test these features to verify they are fully functional (For example: Run test backup operations to the VTL)
Autosupport and alert testing from both active and standby
In the event CONNECTEMC (Secure Remote Services) is being used to forward ASUPS to Data Domain, use the following command to verify connectivity on both nodes.
The timestamp indicates when the last connection was established.
HA Filesystem Troubleshooting
# filesys status
Verify that the FS is enabled and running. Cleaning status could also be displayed.
(active:1)# (standby:0)# system upgrade status
From both nodes, verify that all upgrades have been completed.
(active:1)# (standby:0)#Date
Ensure time and date matches on both nodes within 10 s
From the Active node, Verify with the customer that DD replication (if configured) is operating as expected.
It is imperative to check that both DDHA nodes are in working order and complete a fail-over, if a failure were to occur.
The CLI commands detailed in this article assist to uncover possible issues that could prevent a successful fail-over.
This guide is broken up into key areas that should be checked.
- HA Hardware and Configuration
- Network
- Filesystem
# net show settings
Network port settings are different, depending on which node the #net show settings command is run. Configured ports on DDHA systems are with a type "floating" or type "fixed." Run
"net show settings"on both nodes and compare the outputs.
- "Floating" interfaces: Verify that any configured Network Card (NIC) port, alias, or veth, which displays an enabled and running state on the active node has an identical enabled and running state on the standby node. It is expected that any configured NIC port, alias, or Veth set to type floating has an IP address displayed on the active node and a corresponding N/A on the standby node.
- "Fixed" interfaces: Verify that any configured NIC port, alias, or veth which is tagged as "fixed" displays an "enabled and running state". "Fixed" interfaces do not have identical configurations between nodes
- Verify the HA interconnect (veth99) is displayed and all required ports are enabled and running, Note: The number of required port connections and slot location for the HA interconnect (veth99) is DD model specific
Active node:
# net show settings port enabled state DHCP IP address netmask type additional setting ------ ------- ------- ---- ------------------------------------ -------------- ------------ ----------------------------------------------- ethMa yes running no 10.25.18.50 255.255.255.0 fixed 2620:0:170:1608:260:16ff:fe5c:92bc** /64 fe80::260:16ff:fe5c:92bc** /64 ethMb no down ipv4 n/a n/a fixed ethMc no down ipv4 n/a n/a fixed ethMd no down ipv4 n/a n/a fixed eth4a yes running no 10.25.18.63 255.255.255.0 floating 2620:0:170:1608:260:16ff:fe51:8c60** /64 fe80::260:16ff:fe51:8c60** /64 eth4b no down no n/a n/a fixed eth4c no down no n/a n/a fixed eth4d no down no n/a n/a fixed eth5a no down no n/a n/a fixed eth5b yes running no 10.25.18.60 255.255.255.0 floating 2620:0:170:1608:260:16ff:fe52:2951** /64 fe80::260:16ff:fe52:2951** /64 eth5c no down no n/a n/a fixed eth5d no down no n/a n/a fixed eth11a yes running n/a n/a n/a interconnect bonded to veth99 eth11b yes running n/a n/a n/a interconnect bonded to veth99 eth11c yes running n/a n/a n/a interconnect bonded to veth99 eth11d yes running n/a n/a n/a interconnect bonded to veth99 veth99 yes running no d:d:d:d:d:0060:1652:0ecc /80 interconnect lacp hash xor-L3L4: eth11a,eth11b,eth11c,eth11d fe80::260:16ff:fe52:ecc** /64 ------ ------- ------- ---- ------------------------------------ -------------- ------------ -----------------------------------------------Standby node:
# net show settings port enabled state DHCP IP address netmask type additional setting ------ ------- ------- ---- ------------------------------------ -------------- ------------ ----------------------------------------------- ethMa yes running no 10.25.18.49 255.255.255.0 fixed 2620:0:170:14567:260:16ff:fe5c:dr3** /64 fe80::260:16ff:fe5c3457c** /64 ethMb no down ipv4 n/a n/a fixed ethMc no down ipv4 n/a n/a fixed ethMd no down ipv4 n/a n/a fixed eth4a yes running no n/a 255.255.255.0 floating 2620:0:170:1608:260:1ght6:fe51:4570** /64 fe80::260:16ff:fe51:7890** /64 eth4b no down no n/a n/a fixed eth4c no down no n/a n/a fixed eth4d no down no n/a n/a fixed eth5a no down no n/a n/a fixed eth5b yes running no n/a 255.255.255.0 floating 2620:0:170:160:456:16ff:fe5234561** /64 fe80::260:16ff:fe52:3456** /64 eth5c no down no n/a n/a fixed eth5d no down no n/a n/a fixed eth11a yes running n/a n/a n/a interconnect bonded to veth99 eth11b yes running n/a n/a n/a interconnect bonded to veth99 eth11c yes running n/a n/a n/a interconnect bonded to veth99 eth11d yes running n/a n/a n/a interconnect bonded to veth99 veth99 yes running no d:d:d:d:d:0e456:1652:dft4c /80 interconnect lacp hash xor-L3L4: eth11a,eth11b,eth11c,eth11d fe80::264:16ff:fec2:ecb** /64 ------ ------- ------- ---- ------------------------------------ -------------- ------------ -----------------------------------------------
Verify network connectivity.
Review the IPs listed on each node, and ensure that each configured IP address on the active node and standby nodes can ping its configured gateway.
Note: Some customers have the ping (ICMP) disabled in their environment. In this event, engage the customer to confirm connectivity.
(active:1)# net route show gateway detailed
IPv4 Default Gateways gateway IP source tables interface address owner ---------- ------ ------ ----------------- ----- 10.25.18.1 static tethMa 10.25.18.50/24 none 10.25.18.1 static teth4a 10.25.18.63/24 none 10.25.18.1 static teth5b 10.25.18.60/24 none ---------- ------ ------ ----------------- -----
Ping the gateway IP address with each configured ethxx.
#(active:1)# ping 10.25.18.1 interface ethMa PING 10.25.18.1 (10.25.18.1) from 10.25.18.50 ethMa: 56(84) bytes of data. 64 bytes from 10.25.18.1: icmp_seq=0 ttl=255 time=0.697 ms (active:1)# ping 10.25.18.1 interface eth4a PING 10.25.18.1 (10.25.18.1) from 10.25.18.63 eth4a: 56(84) bytes of data. 64 bytes from 10.25.18.1: icmp_seq=0 ttl=255 time=1.31 ms (active:1)# ping 10.25.18.1 interface eth5b PING 10.25.18.1 (10.25.18.1) from 10.25.18.63 eth4a: 56(84) bytes of data. 64 bytes from 10.25.18.1: icmp_seq=0 ttl=255 time=1.31 ms
# net troubleshooting duplicate-ip
From both nodes check for duplicate IPs
No duplicate IP addresses detected
Fiber Channel Testing
Verify that these features are licensed and then test these features to verify they are fully functional (For example: Run test backup operations to the VTL)
# license show or # elicense show ## License Key Feature -- ------------------- ---------------------------------------- 1 WTXV-TSWX-HWDR-RHDX VTL 2 EZXW-SZZF-BGCS-VRZX Block services (Vdisk) 3 .... HA
Autosupport and alert testing from both active and standby
(active:1)## autosupport test alert-summary OK: Message sent. (active:1)## autosupport test support-notify OK: Message sent. (standby:0)# autosupport test alert-summary OK: Message sent. (standby:0)# autosupport test support-notify OK: Message sent.
In the event CONNECTEMC (Secure Remote Services) is being used to forward ASUPS to Data Domain, use the following command to verify connectivity on both nodes.
The timestamp indicates when the last connection was established.
sysadmin@hostname# support connectemc show history File Time Transport Result --------------------------------------- --------------------- --------- -------- RSC_CKM00XXX601153_120315_092804166.xml "2015-12-03 09:28:07" HTTP Success RSC_CKM00XXX601153_120315_101257767.xml "2015-12-03 10:13:00" HTTP Success RSC_CKM00XXX601153_120315_111649065.xml "2015-12-03 11:16:53" HTTP Success --------------------------------------- --------------------- --------- -------- Note: It says HTTP above, but it is HTTPS
HA Filesystem Troubleshooting
# filesys status
Verify that the FS is enabled and running. Cleaning status could also be displayed.
The filesystem is enabled and running. Cleaning started at 2016/08/20 14:12:16: phase 1 of 12 (pre-merge) 0.7% complete, 95911 GiB free; time: phase 0:00:09, total 0:00:09
(active:1)# (standby:0)# system upgrade status
From both nodes, verify that all upgrades have been completed.
Current Upgrade Status: DD OS upgrade Succeeded End time: 2016.08.20:13:27
(active:1)# (standby:0)#Date
Ensure time and date matches on both nodes within 10 s
-p1(active:1)# date Sat Aug 20 14:34:29 EDT 2016 -p0(standby:0)# date Sat Aug 20 14:34:17 EDT 2016
From the Active node, Verify with the customer that DD replication (if configured) is operating as expected.
# replication status CTX Destination Enabled Connection Sync'ed-as-of-time --- --------------------------------------------------------- ------- ---------------- ------------------ 3 mtree://ddxxx.com/data/col1/eric.dest no idle Fri Nov 6 15:16 4 mtree://ddxxx.com/data/col1/thy-repl yes idle Fri Jul 22 15:38 5 dir://ddxxxx.com/backup/replicate-rtp yes disconnected Fri Jul 22 14:55 6 mtree://ddxxxx.com/data/col1/theman_test yes idle Sat Aug 20 22:11 7 dir://ddxxx.com/backup/lakeland/sym yes Sat Aug 20 13:15 Fri Aug 19 15:09 --- --------------------------------------------------------- ------- ---------------- ----------------
Cause
HA Hardware and Configuration
On both the active node and standby node, check if there is an active alert pointing to a potential issue. Alerts are not always shared between nodes, so check both nodes. If an unexpected issue is encountered, file a support case. Always generate a support bundle from both of the nodes.
Note. Most alerts are seen on only one of the nodes. Not every alert is shared between nodes.
If the status is 'highly available' fail-over is enabled.
If the status is 'degraded', or one of the nodes is not showing "online" state, then fail-over between nodes is disabled.
# ha status detailed
The command #ha status detailed on the active node only can be used for more detailed information regarding the HA status.
Any of the below outputs showing 'not ok' under Mirroring Status section indicates a non-functioning component and the HA System Status displays as 'degraded'.
Any degraded state prevents fail-over between nodes.
Note: This command is not available on the standby node.
# enclosure show io-cards
Verify that both nodes have identical, supported configurations.
# enclosure show misconfiguration
Perform a misconfiguration test from the active node and standby nodes to check if there is any problem with the hardware configuration.
Reference KB https://www.dell.com/support/kbdoc/en-us/463399
Examples:
# enclosure show topology
Check the topology from both nodes.
Look for any errors between connection points, and ensure that all shelf numbering is correct.
# enclosure test topology all duration 1
From both the active and standby nodes, perform a 1 minute diagnostic test for all SAS HBA port with attached external storage.
Do not perform topology testing on both nodes simultaneously.
The expected result is no error detected for each port with storage attached.
If a problem is found, the test can stop with a failure message indicating the SAS connection with a fault, or may show an error (? , ! ) at a particular connection.
Note: During the topology test, individual ports have a separate output, indicating the state. Look for errors (? , ! ) to pinpoint the problem connection. No CLI output is shown until each port test is completed.
# system show nvram
On both active and standby nodes, ensure that Nvram batteries are charged or charging; and that all the nvram error counters show a value of zero.
# alerts show current
On both the active node and standby node, check if there is an active alert pointing to a potential issue. Alerts are not always shared between nodes, so check both nodes. If an unexpected issue is encountered, file a support case. Always generate a support bundle from both of the nodes.
Note. Most alerts are seen on only one of the nodes. Not every alert is shared between nodes.
Alert Examples:
Severity Class Object Message -------- --------------- ------ ---------------------------------------------- CRITICAL HardwareFailure EVT-ENVIRONMENT-00049: The system detected an invalid hardware configuration. -- ------------------------ -------- --------------- ------ ---------------------------------------------- CRITICAL HardwareFailure EVT-ENVIRONMENT-00048: Filesystem can't be enabled due to an invalid hardware configuration. -- ------------------------ -------- --------------- ------ ---------------------------------------------- WARNING HardwareFailure Enclosure=1:Slot=5 EVT-ENVIRONMENT-00047: PCI communication speed is degraded -- ------------------------ -------- --------------- ------ ---------------------------------------------- WARNING HA EVT-HA-00003: Standby node time is off by 15 second(s). -- ------------------------ -------- --------------- ------ ---------------------------------------------- WARNING HardwareFailure Port Index=1 EVT-MPATH-00003: Missing disk connection from system port 6a. -- ------------------------ -------- --------------- ------ ----------------------------------------------
# ha status
Command #ha status on the active node and standby node can be used to determine the current HA status.If the status is 'highly available' fail-over is enabled.
If the status is 'degraded', or one of the nodes is not showing "online" state, then fail-over between nodes is disabled.
SE@hostname-p0(active:0)## ha status HA System name:hostname-n1.chaos.local HA System status: highly available Node Name Node id Role HA State ------------------------------- ------- ------- -------- hostname-p0.chaos.local 0 active online hostname-p1.chaos.local 1 standby online ------------------------------- ------- ------- --------
# ha status detailed
The command #ha status detailed on the active node only can be used for more detailed information regarding the HA status.
Any of the below outputs showing 'not ok' under Mirroring Status section indicates a non-functioning component and the HA System Status displays as 'degraded'.
Any degraded state prevents fail-over between nodes.
Note: This command is not available on the standby node.
SEhostname-p0(active:0)## ha status detailed HA System name: hostname.chaos.local HA System Status: highly available Interconnect Status: ok Primary Heartbeat Status: ok External LAN Heartbeat Status: not ok Hardware compatibility check: ok Software Version Check: ok Node hostname-p0.chaos.local: Role: active HA State: online Node Health: ok Node hostname-p1.chaos.local: Role: standby HA State: online Node Health: ok Mirroring Status: Component Name Status -------------- ------ nvram ok registry ok sms ok ddboost ok cifs ok -------------- ------
# enclosure show io-cards
Verify that both nodes have identical, supported configurations.
# enclosure show misconfiguration
Perform a misconfiguration test from the active node and standby nodes to check if there is any problem with the hardware configuration.
Reference KB https://www.dell.com/support/kbdoc/en-us/463399
Examples:
Memory DIMMs: Locator Bank Locator Size(GiB) Status ------- ------------ --------- ---------- CHCD1 7 0 missing CHDD1 7 0 missing CHAD0 4 8 wrong size CHBD0 4 8 wrong size IO Cards: Slot Device Status ---- ---------- --------- 10 Hera NVRAM extra 10 Hera NVRAM misplaced ---- ---------- --------- CPUs: No misconfiguration found. Disks: Slot Size(GiB) Type Media Status ---- --------- ---- ----- ------- 2 186 SATA SSD missing ---- --------- ---- ----- -----
# enclosure show topology
Check the topology from both nodes.
Look for any errors between connection points, and ensure that all shelf numbering is correct.
- Errors and faults are symbolized with '?', '!' Or '!!'
Note: That topology outputs for each node should be reversed (mirror image) of one another.
(Stdby:0)## enclosure show topology
Port enc.ctrl.port enc.ctrl.port enc.ctrl.port enc.ctrl.port
---- - ------------- - ------------- - ------------- - -------------
2a
2b
2c
2d > 5.A.E: 5.A.H ? 4.A.E: 4.A.H > 3.A.E: 3.A.H > 2.A.E: 2.A.H
3a
3b
3c
3d
6a !! 2.B.E: 2.B.H > 3.B.E: 3.B.H > 5.B.E: 5.B.H > ?.B.E: ?.B.H
6b
6c
6d
---- - ------------- - ------------- - ------------- - -------------
(active:1)## enclosure show topology
Port enc.ctrl.port enc.ctrl.port enc.ctrl.port enc.ctrl.port
---- - ------------- - ------------- - ------------- - -------------
2a
2b
2c
2d > 2.A.H: 2.A.E > 3.A.H: 3.A.E > 4.A.H: 4.A.E > 5.A.H: 5.A.E
3a
3b
3c
3d
6a > 5.B.H: 5.B.E > 4.B.H: 4.B.E > 3.B.H: 3.B.E > 2.B.H: 2.B.E
6b
6c
6d
---- - ------------- - ------------- - ------------- - -------------
# enclosure test topology all duration 1
From both the active and standby nodes, perform a 1 minute diagnostic test for all SAS HBA port with attached external storage.
Do not perform topology testing on both nodes simultaneously.
The expected result is no error detected for each port with storage attached.
If a problem is found, the test can stop with a failure message indicating the SAS connection with a fault, or may show an error (? , ! ) at a particular connection.
Note: During the topology test, individual ports have a separate output, indicating the state. Look for errors (? , ! ) to pinpoint the problem connection. No CLI output is shown until each port test is completed.
# enclosure test topology Started: 1471719316 Ended: 1471719498 Duration: 182 Port enc.ctrl.port enc.ctrl.port enc.ctrl.port enc.ctrl.port ---- - --------------- - --------------- - --------------- - --------------- 2d > 5.A.H:5.A.E > 4.A.H:4.A.E > 3.A.H:3.A.E > 2.A.H:2.A.E ---- - --------------- - --------------- - --------------- - --------------- Error message: ----------------- No error detected -----------------
# system show nvram
On both active and standby nodes, ensure that Nvram batteries are charged or charging; and that all the nvram error counters show a value of zero.
# system show nvram NVRAM Cards: Card Component Value ---- ----------------------- ---------------------------------------------------------------------- 1 Slot 0 Firmware version 0.0.80 Memory size 7.93 GiB Errors 0 memory (0 uncorrectable), 0 PCI, 0 controller Flash controller Errors 0 Cfg Err, 0 PANIC, 0 Bus Hang, 0 Bad Blk Warn, 0 Bkup Err, 0 Rstr Err Board temperature 37 C CPU temperature 47 C Number of batteries 1 ---- ----------------------- ---------------------------------------------------------------------- NVRAM Batteries: Card Battery Status Charge Charging Time To Temperature Voltage Status Full Charge ---- ------- ------ ------ -------- ----------- ----------- ------- 1 1 ok 94 % enabled 0 mins 34 C 4.016 V ---- ------- ------ ------ -------- ----------- ----------- -------
Resolution
If further assistance is required, contact your contracted Service Provider.
Additional Information
.
Affected Products
Data DomainProducts
Data Domain, DD OS 6.0Article Properties
Article Number: 000017861
Article Type: Solution
Last Modified: 05 Jul 2024
Version: 3
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.