Avamar Gen4s - SLC SSD Drive in internal slot 12 Failed

Issues Summary : We have Avamar deployment with a mix of GEN4S ( storage Node) and Gen4 (Storage Node). One of the Node is reporting issues with error about the ssd disk failure

dmesg

-----------

[99791630.889675] sd 0:2:6:0: [sdg] Unhandled error code
[99791630.889686] sd 0:2:6:0: [sdg] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[99791630.889691] sd 0:2:6:0: [sdg] CDB: Read(10): 28 00 09 3f ef 0f 00 00 08 00
[99791630.889703] end_request: I/O error, dev sdg, sector 155184911

[99791638.111977] sd 0:2:6:0: [sdg] Synchronizing SCSI cache
[99791638.128284] ext3_abort called.
[99791638.128289] EXT3-fs error (device sdg1): ext3_journal_start_sb: Detected aborted journal
[99791638.128293] Remounting filesystem read-only
[99824841.616047] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0
[99824841.616065] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0
[99824841.616077] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0
[99824841.616088] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0
[99824841.616099] EXT3-fs error (device sdg1): ext3_find_entry: reading directory #2 offset 0

dpn@backup8:~/>: df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 7.9G 1.8G 5.8G 23% /
/dev/sda1 114M 65M 44M 60% /boot
/dev/sda3 1.8T 1.4T 424G 77% /data01
/dev/sda7 1.5G 172M 1.3G 13% /var
/dev/sdb1 1.9T 1.1T 582G 69% /data02
/dev/sdc1 1.9T 1.1T 576G 70% /data03
/dev/sdd1 1.9T 1.1T 579G 69% /data04
/dev/sde1 1.9T 1.1T 577G 70% /data05
/dev/sdf1 1.9T 1.1T 568G 70% /data06
/dev/sdg1 91G 53G 35G 61% /ssd01

dpn@backup8:~/>: cd /ssd01

dpn@backup8 :/ssd01/>: ls
ls: reading directory .: Input/output error

So basically /ssd01 (/dev/sdg1) is gone corrupted / bad. Does that mean we have to replace the entire Node ? Will there be a pair for this SSD ? Would an unmount and fsck could fix this ?

Unfortunately we do not have current support on this platforms. Anyone faced similar issues what would be the best approach on this ?

Responses(2)

ionthegeek

2K Posts

0

January 5th, 2021 07:00

If the SSD has gone offline like this, it's normally because it has failed at a hardware level. If that's the case, fsck will not fix it. The SSD in Gen4 and Gen4S nodes is not field replaceable, so this node will need to be replaced. The SSD is not mirrored but the data that was stored on the SSD can be recovered by rebuilding the node or rolling back to a recent checkpoint once the hardware issue is resolved.

I recommend contacting support. You may be able to pay Time & Materials to have Dell take care of it.

I

iamtheroot

14 Posts

0

January 5th, 2021 11:00

Thanks for the update. It's so silly that an SSD fail need a whole node replaced. Expensive as well to replace whole node.

View All

No Events found!