Highlighted
Lu_shaoyong
1 Nickel

Isilon X200故障

Hi,

   Isilon X200 Onefs Version 7.0.1.5

遇到故障报错

Device var 1,Provider ad4p7 disconnected. Boot mirror is critical.

Device root 1,Provider ad4p5 disconnected. Boot mirror is critical.

Unhealthy boot disk(ad4),mirror is degraded.

使用命令 #gmirror status 输出如下:

   Name Status  Components
mirror/root0                    COMPLETE   ad7p4

                                                          ad4p4
mirror/var-crash              COMPLETE    ad7p10
mirror/mfg                      COMPLETE    ad7p9
                                                           ad4p10
mirror/journal-backup      COMPLETE   ad7p8
                                                           ad4p8
mirror/var1                     DEGRADED    ad7p7                                                       
mirror/var0                     COMPLETE    ad4p6

                                                           ad7p6
mirror/root1                   DEGRADED    ad7p5

使用命令#atacontrol list 如下:

ATA channel 0:
     Master:      no device present
     Slave:       no device present
ATA channel 1:
     Master:      no device present
     Slave:       no device present
ATA channel 2:
     Master:      ad4 <SanDisk SSD P4 8GB/SSD 8.10> Serial ATA v1.0 II
     Slave:       no device present
ATA channel 3:
     Master:      no device present
     Slave:       ad7 <SanDisk SSD P4 8GB/SSD 8.10> Serial ATA v1.0 II
ATA channel 4:
     Master:      no device present
     Slave:       no device present
ATA channel 5:
     Master:      no device present
     Slave:       no device present

根据以上信息,我判断是ad4的boot drive故障,但是我用SolVe Generator生成的更换文档是更换 Slave的,而现场的是Master故障。

文档中前几步骤如: Install a drive support package等的都是为了生成日志信息,以保留给support,然后就是直接关机更换。


现在有以下几个疑问:

1、现在有点怀疑这个情况是否需要更换这个硬件?

2、isilon是如何引导的,我这的情况是master故障,我更换后怎么确认系统能从原来的slave这个boot drive引导?

3、如果我有一台isilon的测试机Onefs 7.1.1.2,而需要更换的isilon Onefs为7.0.1.5。我使用带有高版本Onefs的Boot drive安装到低版本的设备上会不会有其他问题?

4、我现在有两台Onefs 7.1.1.2的X200,我将其中一台的master boot drive拔下,用另外一台来更换,结果无法启动                                                      QQ图片20170121000337.jpg

所以我现在怀疑我的判断是不是正确的了?

求大神们帮帮忙啊。

标签 (1)
标记 (4)
0 项奖励
9 条回复9
Roger_Wu
4 Ruthenium

Re: Isilon X200故障

根据KB提示建议先升级一下:ETA 194692: EMC Isilon nodes: Boot flash drives become non-operational due to excessive writes https://support.emc.com/kb/301967

These issues are addressed in OneFS 7.2.0.0, 7.1.1.2, 7.1.0.6, 7.0.2.12, and 6.5.5.29.

If any S200, S210, X200, X400, X410, NL400, or 108NL nodes in your cluster have experienced boot drive failures, EMC strongly recommends that you upgrade to the appropriate version of OneFS recommended below as soon as possible:

If the version of OneFS running on your cluster is:Upgrade to:
OneFS 6.5.5.0 - 6.5.5.22OneFS 6.5.5.29
OneFS 7.0.1OneFS 7.0.2.12
OneFS 7.0.2.0 - 7.0.2.11OneFS 7.0.2.12
OneFS 7.1.0.0 - 7.1.0.5OneFS 7.1.0.6
OneFS 7.1.1.1OneFS 7.1.1.2
OneFS 7.2.0.0No need to upgrade

如果升级失败的话基本上要联系售后工程师来处理了,degraded boot drive可能无法升级成功:OneFS: Cannot perform upgrade with degraded boot drive https://support.emc.com/kb/456690

后面的几个问题要Isilon专家来解答一下。

Jeffey1
4 Germanium

Re: Isilon X200故障

shaoyong

根据你提供的gmirror status命令输出分析,故障的boot drive编号为ad7.

Untitled.png

Dell EMC官方文档中有提及,编号为ad7的组件是slave boot drive,下面是文档截图:

Untitled_2.png

所以我觉得你应该替换slave boot drive,而不是master组件。更多信息,请参考文档《Boot Drive Replacement Guide》。

0 项奖励
Lu_shaoyong
1 Nickel

Re: Re: Isilon X200故障

Hi,

这个确认上ad4,不是ad7

123.png

并且我在测试机器上直接在线拔过J3,然后再看mirror状态,跟我提供的图上一样的

0 项奖励
Jeffey1
4 Germanium

Re: Isilon X200故障

shaoyong

如果是ad4就是master坏了,可是你提供的命令输出是ad7哦。

Untitled.png

0 项奖励
Lu_shaoyong
1 Nickel

Re: Re: Isilon X200故障

Hi,

   每个mirror都有两部分(除了mirror/var-crash之外),分为ad4、ad7,我的结果中,只有ad7,而ad4 missing,所以故障位置应该上ad4。

下图是我在测试设备上将J3拔出之后的结果:

1123.png

0 项奖励
Jeffey1
4 Germanium

Re: Isilon X200故障

shaoyong

那你可以按照文档《Boot Drive Replacement Guide》第4页之后的操作步骤,继续替换故障的备件。

Lu_shaoyong
1 Nickel

Re: Isilon X200故障

Hi,

   我按照此文档做了,在关闭节点前的步骤仅Install a drive support package这个没做,但是这个步骤也好像是收集信息以提供给support的。

   按照文档来,就是关闭节点然后更换,完成后启动节点。我做完后启动节点无法启动报错如下:

Executing GEOM bootdisk startup...

This system has 2 formatted boot disks (ad7 and ad4),

but the boot disk IDs are not a pair.

UnbootableBootdiskException: 5: Exception caught in startup attempt 1

Traceback (most recent call last):

  File "/usr/lo cal/lib/python2.6/site-packages/isi/sys/bootdisk.py", line 1831, in startup

  File "/usr/local/lib/python2.6/site-packages/isi/sys/bootdisk.py", line 1741, in _startup

  File "/usr/local/lib/python2.6/site-packages/isi/sys/bootdisk.py", line 1667, in handle_bootdisk_ids

  File "/usr/local/lib/python2.6/site-packages/isi/sys/bootdisk.py", line 1628, in two_bootdisks_two_ids

UnbootableBootdiskException: 5

The system is unbootable.

GEOM start failed

Please contact EMC Customer Support:

United States: 1 800 782 4362 (1 800 SVC 4EMC)

Canada: 1 800 543 4782 (1 800 543 4SVC)

Worldwide Country Code: 1 508 497 7901

Command Options:

1) Enter recovery shell

2) Continue booting

3) Reboot

option> No handlers could be found for logger "lcd.library"

0 项奖励
Jeffey1
4 Germanium

Re: Isilon X200故障

shaoyong


根据你目前的情况,你可以参考:457965 : OneFS 7.0.2 and 7.1.0: Node fails to boot completely after replacing a boot drive or joining a cluster https://support.emc.com/kb/457965

Roger_Wu
4 Ruthenium

Re: Isilon X200故障

楼主后来设备起来了不?欢迎来分享解决经验。

0 项奖励