14 Posts
0
905
Avamar Gen4s - SLC SSD Drive in internal slot 12 Failed
Issues Summary : We have Avamar deployment with a mix of GEN4S ( storage Node) and Gen4 (Storage Node). One of the Node is reporting issues with error about the ssd disk failure
dmesg
-----------
[99791630.889675] sd 0:2:6:0: [sdg] Unhandled error code
[99791630.889686] sd 0:2:6:0: [sdg] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[99791630.889691] sd 0:2:6:0: [sdg] CDB: Read(10): 28 00 09 3f ef 0f 00 00 08 00
[99791630.889703] end_request: I/O error, dev sdg, sector 155184911
[99791638.111977] sd 0:2:6:0: [sdg] Synchronizing SCSI cache
[99791638.128284] ext3_abort called.
[99791638.128289] EXT3-fs error (device sdg1): ext3_journal_start_sb: Detected aborted journal
[99791638.128293] Remounting filesystem read-only
[99824841.616047] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0
[99824841.616065] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0
[99824841.616077] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0
[99824841.616088] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0
[99824841.616099] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0
dpn@backup8:~/>: df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 7.9G 1.8G 5.8G 23% /
/dev/sda1 114M 65M 44M 60% /boot
/dev/sda3 1.8T 1.4T 424G 77% /data01
/dev/sda7 1.5G 172M 1.3G 13% /var
/dev/sdb1 1.9T 1.1T 582G 69% /data02
/dev/sdc1 1.9T 1.1T 576G 70% /data03
/dev/sdd1 1.9T 1.1T 579G 69% /data04
/dev/sde1 1.9T 1.1T 577G 70% /data05
/dev/sdf1 1.9T 1.1T 568G 70% /data06
/dev/sdg1 91G 53G 35G 61% /ssd01
dpn@backup8:~/>: cd /ssd01
dpn@backup8 :/ssd01/>: ls
ls: reading directory .: Input/output error
So basically /ssd01 (/dev/sdg1) is gone corrupted / bad. Does that mean we have to replace the entire Node ? Will there be a pair for this SSD ? Would an unmount and fsck could fix this ?
Unfortunately we do not have current support on this platforms. Anyone faced similar issues what would be the best approach on this ?
ionthegeek
2K Posts
0
January 5th, 2021 07:00
If the SSD has gone offline like this, it's normally because it has failed at a hardware level. If that's the case, fsck will not fix it. The SSD in Gen4 and Gen4S nodes is not field replaceable, so this node will need to be replaced. The SSD is not mirrored but the data that was stored on the SSD can be recovered by rebuilding the node or rolling back to a recent checkpoint once the hardware issue is resolved.
I recommend contacting support. You may be able to pay Time & Materials to have Dell take care of it.
iamtheroot
14 Posts
0
January 5th, 2021 11:00
Thanks for the update. It's so silly that an SSD fail need a whole node replaced. Expensive as well to replace whole node.