当节点报告为“Down”或“Offline”时应采取的措施

Summary: 如何确定节点是否关闭以及连接到处于关闭状态的节点的方法。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Instructions

每当某个节点在与群集中的其他节点通信时出现问题,就会报告为离线。从硬件到作系统,有许多原因会导致一个或多个节点报告为此状态。节点关闭的最常见指示器是事件消息。如果某个节点失去与群集中其余节点的连接,则会报告“节点离线”事件:

2.21767  02/27 05:14 C    3    173520         Node 3 is offline

 

如果您看到与此类似的事件,请确定节点是否已恢复或仍处于离线状态。要确定这一点,请使用 isi status 的输出。

如果 isi status 输出将所有节点报告为 OK:

testcluster-1# isi status
Cluster Name: testcluster
Cluster Health:     [  OK ]
Data Reduction:     1.33 : 1
Storage Efficiency: 0.72 : 1
Cluster Storage:  HDD                 SSD Storage
Size:             0 (0 Raw)           16.7T (20.3T Raw)
VHS Size:         3.6T
Used:             0 (n/a)             22.0G (< 1%)
Avail:            0 (n/a)             16.7T (> 99%)

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage
ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size
---+----------------+-----+---+-----+-----+-----+-----------------+-----------------
  1|xxx.xxx.xxx.148 | OK  | C |    0| 524k| 524k|(No Storage HDDs)| 6.4G/ 5.6T(< 1%)
  2|xxx.xxx.xxx.149 | OK  | C |962.0|23.1M|23.1M|(No Storage HDDs)| 6.4G/ 5.6T(< 1%)
  3|xxx.xxx.xxx.150 | OK  | C |    0|    0|    0|(No Storage HDDs)| 9.2G/ 5.6T(< 1%)
---+----------------+-----+---+-----+-----+-----+-----------------+-----------------
Cluster Totals:              |962.0|23.7M|23.7M|(No Storage HDDs)|22.0G/16.7T(< 1%)

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
           External Network Fields: C = Connected, N = Not Connected

Critical Events:
Time            LNN  Event
--------------- ---- -------------------------------------------------------


Cluster Job Status:

No running jobs.

No paused or waiting jobs.

No failed jobs.

Recent job results:
Time            Job                        Event
--------------- -------------------------- ------------------------------
02/27 04:00:38  ShadowStoreProtect[518]    Succeeded
02/27 02:00:14  WormQueue[517]             Succeeded
 

在此示例中,所有节点报告为 OK。这表示所有节点都处于联机状态,并且是群集的一部分。确定是否有人重新启动了节点,或者是否正在执行维护。如果您不确定重新启动的原因,则需要收集日志并创建服务请求。

如果 isi status 报告节点处于 注意:

testcluster-1# isi status
Cluster Name: testcluster
Cluster Health:     [ ATTN]
Data Reduction:     1.33 : 1
Storage Efficiency: 0.72 : 1
Cluster Storage:  HDD                 SSD Storage
Size:             0 (0 Raw)           15.0T (18.6T Raw)
VHS Size:         3.6T
Used:             0 (n/a)             21.2G (< 1%)
Avail:            0 (n/a)             15.0T (> 99%)

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage
ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size
---+---------------+-----+---+-----+-----+-----+-----------------+-----------------
  1|xxx.xxx.xxx.148 | OK  | C | 2.1k|16.9k|19.0k|(No Storage HDDs)| 6.4G/ 5.5T(< 1%)
  2|xxx.xxx.xxx.149 | OK  | C | 1.8M|10.0M|11.9M|(No Storage HDDs)| 6.4G/ 5.5T(< 1%)
  3|xxx.xxx.xxx.150 |-A-- | C | 4.0k|480.0| 4.5k|(No Storage HDDs)|10.7G/ 5.5T(< 1%)
---+----------------+-----+---+-----+-----+-----+-----------------+-----------------
Cluster Totals:              | 1.8M|10.0M|11.9M|(No Storage HDDs)|21.2G/15.0T(< 1%)

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
           External Network Fields: C = Connected, N = Not Connected

Critical Events:
Time            LNN  Event
--------------- ---- -------------------------------------------------------


Cluster Job Status:

Running jobs:
Job                        Impact Pri Policy     Phase Run Time
-------------------------- ------ --- ---------- ----- ----------
FlexProtectLin[520]        Medium 1   MEDIUM     4/4   0:00:34
        Job Description: Working on nodes: None   and drives: node3:bay1

No paused or waiting jobs.

No failed jobs.

Recent job results:
Time            Job                        Event
--------------- -------------------------- ------------------------------
02/27 04:00:38  ShadowStoreProtect[518]    Succeeded
02/27 02:00:14  WormQueue[517]             Succeeded

节点上的 isi status 输出显示为注意 -A--,这是由群集上的严重事件触发的。处于注意状态的节点处于联机状态,并且是群集的一部分,但报告了问题。您可以使用 isi event list查看针对处于“注意”状态的节点报告的关键事件。在这种情况下,这是由于针对驱动器托架 1 运行的 FlexProtectLin 作业造成的。与 OK 状态一样,您希望确定节点重新启动的原因(如果可以)。否则,您需要收集日志并创建服务请求。

如果 isi 状态将节点报告为 Down:

testcluster-1# isi status
Cluster Name: testcluster
Cluster Health:     [ ATTN]
Data Reduction:     1.33 : 1
Storage Efficiency: 0.72 : 1
Cluster Storage:  HDD                 SSD Storage
Size:             0 (0 Raw)           9.9T (13.5T Raw)
VHS Size:         3.6T
Used:             0 (n/a)             12.7G (< 1%)
Avail:            0 (n/a)             9.9T (> 99%)

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage
ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size
---+---------------+-----+---+-----+-----+-----+-----------------+-----------------
  1|xxx.xxx.xxx.148 | OK  | C |    0|73.9k|73.9k|(No Storage HDDs)| 6.4G/ 5.0T(< 1%)
  2|xxx.xxx.xxx.149 | OK  | C |    0|11.3k|11.3k|(No Storage HDDs)| 6.4G/ 5.0T(< 1%)
  3|xxx.xxx.xxx.150 |D--- | N |  n/a|  n/a|  n/a|  n/a/  n/a( n/a)|  n/a/  n/a( n/a)
---+---------------+-----+---+-----+-----+-----+-----------------+-----------------
Cluster Totals:              |  n/a|  n/a|  n/a|(No Storage HDDs)|12.7G/ 9.9T(< 1%)

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
           External Network Fields: C = Connected, N = Not Connected

Critical Events:
Time            LNN  Event
--------------- ---- -------------------------------------------------------
02/27 05:14:20  3    Node 3 offline


Cluster Job Status:

No running jobs.

No paused or waiting jobs.

No failed jobs.

Recent job results:
Time            Job                        Event
--------------- -------------------------- ------------------------------
02/27 04:00:38  ShadowStoreProtect[518]    Succeeded
02/27 02:00:14  WormQueue[517]             Succeeded
02/27 00:00:21  ShadowStoreDelete[516]     Succeeded

isi status输出将节点显示为 Down D---,这表示节点无法与群集通信。如果节点未因已知原因关闭(正在执行硬件维护、正在升级群集作系统等),请查看您是否可以与节点建立连接并立即创建服务请求。

远程建立与关闭节点的连接

如果节点关闭,则意味着它无法与群集通信。不过,您仍然可以连接到节点。您仍然可以远程登录,或是通过串行连接登录。

您可以从群集中的另一个节点尝试使用内部网络连接到关闭的节点。尝试 ping 群集名称节点编号?使用上面输出中的节点 3:

testcluster-1# ping testcluster-3
PING testcluster-3 (128.221.254.3): 56 data bytes
64 bytes from 128.221.254.3: icmp_seq=0 ttl=64 time=0.048 ms
64 bytes from 128.221.254.3: icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from 128.221.254.3: icmp_seq=2 ttl=64 time=0.043 ms
^C
--- testcluster-3 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss

 在此示例中,我们能够对 clustername-node 编号执行 ping作,即使该节点报告为关闭也是如此。我们将尝试通过 ssh 连接到节点,看看是否可以连接。

如果节点在公用网络上具有静态分配的 IP 地址,则可以连接到该地址。要确定您是否具有来自群集的静态分配地址,请使用 isi network 命令:
 

testcluster-1# isi network interfaces list | grep Static
1    25gige-1     Up         -        groupnet0.subnet0.pool0 Static      192.168.1.148
2    25gige-1     Up         -        groupnet0.subnet0.pool0 Static      192.168.1.149
3    25gige-1     Unknown    -        groupnet0.subnet0.pool0 Static      192.168.1.150

 在此示例中,群集中的节点 3 的静态分配地址为 192.168.1.150。从群集中的另一个节点或有权访问该网络的工作站,我们将尝试对地址执行 ping作。如果我们可以成功 ping 地址,我们将尝试通过 ssh 连接到节点。

在本地建立与关闭节点的连接

如果有人在现场,并且计算机具有串行端口或 USB 转串行适配器以及零调制解调器线缆或带零调制解调器适配器的串行线缆。它们可以直接连接到节点以进行故障处理。有关如何连接到节点上的串行端口的信息,请参阅 PowerScale:无法远程连接时客户连接到串行端口的步骤

Affected Products

PowerScale, Isilon Gen6.5, Isilon Gen6, Isilon NL-Series, PowerScale OneFS, Isilon S-Series, Isilon Scale-out NAS, Isilon X-Series
Article Properties
Article Number: 000290053
Article Type: How To
Last Modified: 02 Jul 2025
Version:  1
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.