當節點報告為關閉或離線時該怎麼辦

Summary: 如何判斷節點是否已關閉,以及如何連線至處於關閉狀態的節點。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Instructions

每當節點與叢集中其他節點通訊時發生問題,就會回報其為離線。有許多原因會導致節點報告處於此狀態,從硬體到作業系統。節點關閉的最常見指標是在事件消息中。如果節點失去與叢集中其餘節點的連線能力,則會回報「節點離線」事件:

2.21767  02/27 05:14 C    3    173520         Node 3 is offline

 

如果看到與此類似的事件,請確定節點是否已恢復或是否仍處於離線狀態。若要判斷這一點,請使用 isi 狀態的輸出。

如果 isi 狀態輸出報告所有節點都正常:

testcluster-1# isi status
Cluster Name: testcluster
Cluster Health:     [  OK ]
Data Reduction:     1.33 : 1
Storage Efficiency: 0.72 : 1
Cluster Storage:  HDD                 SSD Storage
Size:             0 (0 Raw)           16.7T (20.3T Raw)
VHS Size:         3.6T
Used:             0 (n/a)             22.0G (< 1%)
Avail:            0 (n/a)             16.7T (> 99%)

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage
ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size
---+----------------+-----+---+-----+-----+-----+-----------------+-----------------
  1|xxx.xxx.xxx.148 | OK  | C |    0| 524k| 524k|(No Storage HDDs)| 6.4G/ 5.6T(< 1%)
  2|xxx.xxx.xxx.149 | OK  | C |962.0|23.1M|23.1M|(No Storage HDDs)| 6.4G/ 5.6T(< 1%)
  3|xxx.xxx.xxx.150 | OK  | C |    0|    0|    0|(No Storage HDDs)| 9.2G/ 5.6T(< 1%)
---+----------------+-----+---+-----+-----+-----+-----------------+-----------------
Cluster Totals:              |962.0|23.7M|23.7M|(No Storage HDDs)|22.0G/16.7T(< 1%)

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
           External Network Fields: C = Connected, N = Not Connected

Critical Events:
Time            LNN  Event
--------------- ---- -------------------------------------------------------


Cluster Job Status:

No running jobs.

No paused or waiting jobs.

No failed jobs.

Recent job results:
Time            Job                        Event
--------------- -------------------------- ------------------------------
02/27 04:00:38  ShadowStoreProtect[518]    Succeeded
02/27 02:00:14  WormQueue[517]             Succeeded
 

在此範例中,所有節點報告為 OK。這表示所有節點都處於線上狀態,並且是叢集的一部分。判斷是否有人將節點重新開機,或是否正在執行維護。如果您不確定重新開機的原因,您可能希望收集記錄並開啟服務要求。

如果 isi 狀態在以下位置回報節點:

testcluster-1# isi status
Cluster Name: testcluster
Cluster Health:     [ ATTN]
Data Reduction:     1.33 : 1
Storage Efficiency: 0.72 : 1
Cluster Storage:  HDD                 SSD Storage
Size:             0 (0 Raw)           15.0T (18.6T Raw)
VHS Size:         3.6T
Used:             0 (n/a)             21.2G (< 1%)
Avail:            0 (n/a)             15.0T (> 99%)

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage
ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size
---+---------------+-----+---+-----+-----+-----+-----------------+-----------------
  1|xxx.xxx.xxx.148 | OK  | C | 2.1k|16.9k|19.0k|(No Storage HDDs)| 6.4G/ 5.5T(< 1%)
  2|xxx.xxx.xxx.149 | OK  | C | 1.8M|10.0M|11.9M|(No Storage HDDs)| 6.4G/ 5.5T(< 1%)
  3|xxx.xxx.xxx.150 |-A-- | C | 4.0k|480.0| 4.5k|(No Storage HDDs)|10.7G/ 5.5T(< 1%)
---+----------------+-----+---+-----+-----+-----+-----------------+-----------------
Cluster Totals:              | 1.8M|10.0M|11.9M|(No Storage HDDs)|21.2G/15.0T(< 1%)

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
           External Network Fields: C = Connected, N = Not Connected

Critical Events:
Time            LNN  Event
--------------- ---- -------------------------------------------------------


Cluster Job Status:

Running jobs:
Job                        Impact Pri Policy     Phase Run Time
-------------------------- ------ --- ---------- ----- ----------
FlexProtectLin[520]        Medium 1   MEDIUM     4/4   0:00:34
        Job Description: Working on nodes: None   and drives: node3:bay1

No paused or waiting jobs.

No failed jobs.

Recent job results:
Time            Job                        Event
--------------- -------------------------- ------------------------------
02/27 04:00:38  ShadowStoreProtect[518]    Succeeded
02/27 02:00:14  WormQueue[517]             Succeeded

節點上的 isi 狀態輸出顯示為注意 -A--,這是由叢集上嚴重事件觸發。處於注意狀態的節點已連線,並且屬於叢集的一部分,但報告問題。您可以使用 isi 事件清單,在注意時查看節點報告了哪些嚴重事件。在此情況下,這是因為針對磁碟機槽 1 執行 FlexProtectLin 工作。與OK狀態一樣,如果可以,您可能希望確定節點重新啟動的原因。如果沒有,您可能希望收集記錄並開啟服務要求。

如果 isi 狀態報告節點為「關閉」:

testcluster-1# isi status
Cluster Name: testcluster
Cluster Health:     [ ATTN]
Data Reduction:     1.33 : 1
Storage Efficiency: 0.72 : 1
Cluster Storage:  HDD                 SSD Storage
Size:             0 (0 Raw)           9.9T (13.5T Raw)
VHS Size:         3.6T
Used:             0 (n/a)             12.7G (< 1%)
Avail:            0 (n/a)             9.9T (> 99%)

                   Health Ext  Throughput (bps)  HDD Storage      SSD Storage
ID |IP Address     |DASR |C/N|  In   Out  Total| Used / Size     |Used / Size
---+---------------+-----+---+-----+-----+-----+-----------------+-----------------
  1|xxx.xxx.xxx.148 | OK  | C |    0|73.9k|73.9k|(No Storage HDDs)| 6.4G/ 5.0T(< 1%)
  2|xxx.xxx.xxx.149 | OK  | C |    0|11.3k|11.3k|(No Storage HDDs)| 6.4G/ 5.0T(< 1%)
  3|xxx.xxx.xxx.150 |D--- | N |  n/a|  n/a|  n/a|  n/a/  n/a( n/a)|  n/a/  n/a( n/a)
---+---------------+-----+---+-----+-----+-----+-----------------+-----------------
Cluster Totals:              |  n/a|  n/a|  n/a|(No Storage HDDs)|12.7G/ 9.9T(< 1%)

     Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
           External Network Fields: C = Connected, N = Not Connected

Critical Events:
Time            LNN  Event
--------------- ---- -------------------------------------------------------
02/27 05:14:20  3    Node 3 offline


Cluster Job Status:

No running jobs.

No paused or waiting jobs.

No failed jobs.

Recent job results:
Time            Job                        Event
--------------- -------------------------- ------------------------------
02/27 04:00:38  ShadowStoreProtect[518]    Succeeded
02/27 02:00:14  WormQueue[517]             Succeeded
02/27 00:00:21  ShadowStoreDelete[516]     Succeeded

isi 狀態輸出顯示節點為 Down-D---,這表示節點無法與叢集通訊。如果節點因已知原因未關閉 (正在執行硬體維護、叢集作業系統正在升級等),請查看您是否可以建立與節點的連線,並立即開立服務要求。

從遠端建立與關閉節點的連線

如果節點關閉,則表示無法與叢集通訊。不過,您仍然可以連接到節點。您仍然可以遠端登入或透過序列連線登入。

您可以從叢集中的另一個節點嘗試使用內部網路連線至關閉節點。嘗試 ping 叢集名稱-節點編號?使用上述輸出中的節點 3:

testcluster-1# ping testcluster-3
PING testcluster-3 (128.221.254.3): 56 data bytes
64 bytes from 128.221.254.3: icmp_seq=0 ttl=64 time=0.048 ms
64 bytes from 128.221.254.3: icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from 128.221.254.3: icmp_seq=2 ttl=64 time=0.043 ms
^C
--- testcluster-3 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss

 在此範例中,我們能夠 ping 叢集名稱-節點編號,即使節點回報為故障。我們會嘗試 ssh 到節點,看看是否可以連線。

如果節點在您的公共網路上具有靜態分配的IP位址,則可以連接到該位址。若要判斷您是否有來自叢集的靜態指派位址,請使用 isi network 命令:
 

testcluster-1# isi network interfaces list | grep Static
1    25gige-1     Up         -        groupnet0.subnet0.pool0 Static      192.168.1.148
2    25gige-1     Up         -        groupnet0.subnet0.pool0 Static      192.168.1.149
3    25gige-1     Unknown    -        groupnet0.subnet0.pool0 Static      192.168.1.150

 在此示例中,群集中的節點 3 具有靜態分配的位址 192.168.1.150。我們會從叢集中的另一個節點或有權存取該網路的工作站,嘗試 ping 該位址。如果能成功 ping 位址,便會嘗試 ssh 進入節點。

建立與本機關閉節點的連線

如果有人在現場,並且他們有一部帶有串行埠或 USB 轉串行適配器的電腦,以及帶有零數據機適配器的零數據機電纜或串行電纜。它們可以直接連接到節點以進行故障排除。如需如何連接至節點序列埠的相關資訊,請參閱 PowerScale:無法進行遠端連線時,客戶連線至序列埠的步驟

Affected Products

PowerScale, Isilon Gen6.5, Isilon Gen6, Isilon NL-Series, PowerScale OneFS, Isilon S-Series, Isilon Scale-out NAS, Isilon X-Series
Article Properties
Article Number: 000290053
Article Type: How To
Last Modified: 02 Jul 2025
Version:  1
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.