Isilon OneFS: Event notification: Node Offline - Event ID: 200010001, 300010003, 399990001, 900160001, 910100006, 400150007
Summary: Isilon OneFS: Event notification: Node Offline - Event ID: 200010001, 300010003, 399990001, 900160001, 910100006, 400150007
Symptoms
Event
You receive a "Node Offline" event notification. Event ID: 200010001.
"Node Offline" events are generated when a node is reported offline by the other nodes in the cluster. This event can also be generated when the internal link is lost on any node.
NOTE: If the node is not turned on, then perform ‘How to power cycle and drain an Isilon node ’.
Cause
Details
One of the following conditions is true:
- One or more nodes rebooted.
- One or more nodes are powered off.
- A node lacks back-end network (InfiniBand (IB)) connectivity. (Back-end connectivity refers to a node's ability to communicate with other nodes.)
- A node cannot join the group.
Resolution
Response
Before you begin troubleshooting the issue, confirm that the event is not related to maintenance on the cluster. After confirming that no maintenance is in progress, proceed with the following troubleshooting.
If the node rebooted
- Open an SSH connection to the node and log on using the "root" account.
- Run the following command to confirm the node rejoined the cluster:
isi status
The isi status command returns output similar to the following. If the node successfully rejoined the cluster, the Health column will not display D (down):
Health Throughput (bps) HDD Storage SSD Storage
ID |IP Address |DASR | In Out Total| Used / Size|Used / Size
-------------------+-----+-----+-----+-----+-----------------+-----------------
1|10.111.183.10 | OK | 115K| 220K| 335K| 531M/ 10T(< 1%)| (No SSDs)
2|10.111.183.11 | OK | 0| 0| 0| 519M/ 10T(< 1%)| (No SSDs)
3|10.111.183.12 | OK | 0| 26K| 26K| 521M/ 10T(< 1%)| (No SSDs)
-------------------+-----+-----+-----+-----+-----------------+-----------------
Cluster Totals: | 115K| 246K| 361K| 1.5G/ 31T(< 1%)| (No SSDs)
Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
- Run the following command to confirm the uptime duration:
uptime
Output similar to the following appears:
8:41PM up 10 mins, 1 user, load averages: 0.08, 0.18, 0.14
If the node recently rebooted, the uptime duration will be relatively short, in minutes.
- Gather logs by running the following command and send them to Isilon Technical Support for analysis:
isi_gather_info
If you can ping the external IP address of the down node
- Confirm the status of the node:
- Open an SSH connection to the node and log on using the "root" account.
- Run the following command:
ifconfig |grep -A4 ib1
The ifconfig command should return the following status indicating that the internal interface is active:
ib1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 2004
lladdr 0.15.1b.0.10.bd.4c.77
inet 172.10.111.200 netmask 0xffffff00 broadcast 1.10.111.255 zone 1
media: Infiniband autoselect
status: active
- If the status is inactive, check the following:
- Are the activity lights for the ports on the IB card on or off?
- If the lights are off, go to step b.
- Are the IB cables firmly attached to the node and the IB switch?
- If not, reseat the cables on the node and the switch.
- Is the IB switch powered on?
- If not, power it on.
- Visually inspect the node to verify that the power light is on.
- Are the activity lights for the ports on the IB card on or off?
If the node is turned off
- Attempt to turn on the node.
NOTE
It is best if you can establish serial access to the node to monitor as it boots up to capture any information that might assist in troubleshooting. For more information, see Isilon: How to connect to the management port of a node. - If the node turns on, confirm whether it rejoined the cluster:
- Open a secure shell (SSH) connection to a different node in the cluster and log on using the root account.
- Run the following command to determine whether the node has rejoined the cluster:
isi status
The isi status command returns output similar to the following. If the node successfully rejoined the cluster, the Health column will not display D (down):
Health Throughput (bps) HDD Storage SSD Storage
ID |IP Address |DASR | In Out Total| Used / Size|Used / Size
-------------------+-----+-----+-----+-----+-----------------+-----------------
1|10.111.183.10 | OK | 115K| 220K| 335K| 531M/ 10T(< 1%)| (No SSDs)
2|10.111.183.11 | OK | 0| 0| 0| 519M/ 10T(< 1%)| (No SSDs)
3|10.111.183.12 | OK | 0| 26K| 26K| 521M/ 10T(< 1%)| (No SSDs)
-------------------+-----+-----+-----+-----+-----------------+-----------------
Cluster Totals: | 115K| 246K| 361K| 1.5G/ 31T(< 1%)| (No SSDs)
Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
- If the node rejoins the cluster, gather logs by running the following command and send them to Isilon Technical Support for analysis:
isi_gather_info
- If the node does not rejoin the cluster, proceed to the next section.
- If the node does not turn on, ensure that the circuit breakers are operational and that the power outlets are active.
- If the node is not receiving power, resolve the power supply issue.
- If the node is off and it is receiving power, contact Isilon Technical Support for help troubleshooting the issue.
If the node is powered on but did not rejoin the cluster
- Attempt to establish remote access via a secure shell (SSH) session. If the SSH session fails, attempt to establish remote access via the serial console.
- If neither the SSH session nor the serial console are responsive, press CTRL+T either within the SSH session or on the serial console.
- If pressing CTRL+T produces output, record the output, and then contact Isilon Technical Support for failure analysis.
- If the node is unresponsive, press the power button three times and then wait five minutes for the node to power off.
- If the node does not power down, press and hold down the power button until the node powers off.
- Press the power button again to power on the node.
- If the node powers up and returns a login prompt, log on using the "root" account.
- Gather logs by running the following command and send them to Isilon technical support for analysis
isi_gather_info
- If the node does not rejoin the cluster, contact Isilon Technical Support for help troubleshooting the issue.
Additional Information
Event Id: 200010002 - NODE_STATUS_ONLINE
Event Id: 200010003 - XTND_OFFLINE
Event Id: 200010005 - DISKNODE_OFFLINE
Event Id: 299990001 - NODE_COALESCE
Event Id: 300020001 - RO_TRANS_FAILED
Event Id: 300010002 - NODE_SHUTDOWN
Event Id: 300020002 - NODE_REBOOT_JRNL_BKUP_FAIL
OneFS error: Could not recover journal
https://www.dell.com/support/kbdoc/32508
How to safely shut down an Isilon cluster prior to a scheduled power outage
https://www.dell.com/support/kbdoc/18989
Event Id: 300010003 - BOOT_TIMEOUT
Event Id: 399990001 - MAINT_REBOOT_COALESCE
Event Id: 300020003 - MAINT_REBOOT_SHUTDOWN_FAILED
Event Id: 300010001 - NODE_REBOOT