Start a Conversation

Unsolved

This post is more than 5 years old

A

2044

July 1st, 2016 20:00

CRITICAL - R710 PERC 6/i VD offline due to firmware bug

I had a VD offline due to a firmware bug.

I'm using the latest firmware version.

This is what I can see in our logs:

07/02/16  3:31:57: EVT#30207-07/02/16  3:31:57: 113=Unexpected sense: PD 05(e0x20/s5) Path 5000c500134f0029, CDB: 28 00 0f f1 b4 46 00 00 08 00, Sense: 6/29/cd
07/02/16  3:31:57: Raw Sense for PD 5: f0 00 06 00 23 5f 29 0a 00 00 00 00 29 cd ce 00 00 00
07/02/16  3:31:57: EVT#30208-07/02/16  3:31:57: 113=Unexpected sense: PD 05(e0x20/s5) Path 5000c500134f0029, CDB: 28 00 0a 0a 45 0e 00 00 08 00, Sense: 2/04/01
07/02/16  3:31:57: Raw Sense for PD 5: 70 00 02 00 00 00 00 0a 00 00 00 00 04 01 00 00 00 00
07/02/16  3:31:57: EVT#30209-07/02/16  3:31:57: 113=Unexpected sense: PD 05(e0x20/s5) Path 5000c500134f0029, CDB: 28 00 0b a0 e5 86 00 00 18 00, Sense: 2/04/01
07/02/16  3:31:57: Raw Sense for PD 5: 70 00 02 00 00 00 00 0a 00 00 00 00 04 01 00 00 00 00
07/02/16  3:31:57: EVT#30210-07/02/16  3:31:57: 113=Unexpected sense: PD 05(e0x20/s5) Path 5000c500134f0029, CDB: 28 00 09 88 c5 fe 00 00 02 00, Sense: 2/04/01
07/02/16  3:31:57: Raw Sense for PD 5: 70 00 02 00 00 00 00 0a 00 00 00 00 04 01 00 00 00 00
07/02/16  3:31:57: EVT#30211-07/02/16  3:31:57: 113=Unexpected sense: PD 05(e0x20/s5) Path 5000c500134f0029, CDB: 28 00 0e 90 4e 00 00 00 56 00, Sense: 2/04/01
07/02/16  3:31:57: Raw Sense for PD 5: 70 00 02 00 00 00 00 0a 00 00 00 00 04 01 00 00 00 00
07/02/16  3:31:57: EVT#30212-07/02/16  3:31:57: 113=Unexpected sense: PD 05(e0x20/s5) Path 5000c500134f0029, CDB: 28 00 0f f1 b4 46 00 00 08 00, Sense: 2/04/01
07/02/16  3:31:57: Raw Sense for PD 5: 70 00 02 00 00 00 00 0a 00 00 00 00 04 01 00 00 00 00
07/02/16  3:31:57: EVT#30213-07/02/16  3:31:57: 113=Unexpected sense: PD 05(e0x20/s5) Path 5000c500134f0029, CDB: 00 00 00 00 00 00, Sense: 2/04/01
07/02/16  3:31:57: Raw Sense for PD 5: 70 00 02 00 00 00 00 0a 00 00 00 00 04 01 00 00 00 00
07/02/16  3:31:58: MPI_EVENT_SAS_DISCOVERY: PortBitmap ff - Discovery is in progress
07/02/16  3:31:58: MPI_EVENT_SAS_DISCOVERY: PortBitmap 0 - Discovery is complete
07/02/16  3:31:58: Disc-prog= 0....resetProg=0 aenCount=0 transit=0 
07/02/16  3:31:59: MPI_EVENT_SAS_DISCOVERY: PortBitmap ff - Discovery is in progress
07/02/16  3:31:59:  MPI_EVENT_SAS_PHY_LINK_STATUS - PhyNum 5 DevHandle 6 Link 09
07/02/16  3:31:59:  MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
07/02/16  3:31:59: MPT_EventDeviceStatusChange: Device Removed DevId 5 Tgt 5 Sas 5000c500:134f0029
07/02/16  3:31:59:  curQdepth 0 WaitQCount 4 path 0 flags:f0500005 
07/02/16  3:31:59:  DM_DevicePathRemoved devId 5 Tid 5 Path 0
07/02/16  3:31:59: PD 5 Removed :DeviceCount=6
07/02/16  3:31:59: MPI_EVENT_SAS_DISCOVERY: PortBitmap 0 - Discovery is complete
07/02/16  3:31:59: Disc-prog= 0....resetProg=0 aenCount=0 transit=0 
07/02/16  3:31:59:  DM_PdScsiTypeSet: Pd 5 type 1f isSata 0 
07/02/16  3:31:59: EVT#30214-07/02/16  3:31:59: 112=Removed: PD 05(e0x20/s5)
07/02/16  3:31:59: EVT#30215-07/02/16  3:31:59: 248=Removed: PD 05(e0x20/s5) Info: enclPd=20, scsiType=0, portMap=05, sasAddr=5000c500134f0029,0000000000000000
07/02/16  3:31:59: EVT#30216-07/02/16  3:31:59: 114=State change on PD 05(e0x20/s5) from ONLINE(18) to FAILED(11)
07/02/16  3:31:59: EVT#30217-07/02/16  3:31:59:  81=State change on VD 00/0 from OPTIMAL(3) to PARTIALLY DEGRADED(1)
07/02/16  3:31:59: modify_log_drv_state: oldState: 3  newState: 1  pinned_cache_present: 0  targetId: 0
07/02/16  3:31:59: EVT#30218-07/02/16  3:31:59: 250=VD 00/0 is now PARTIALLY DEGRADED
07/02/16  3:31:59: EVT#30219-07/02/16  3:31:59: 114=State change on PD 05(e0x20/s5) from FAILED(11) to UNCONFIGURED_BAD(1)
07/02/16  3:32:04: EVT#30220-07/02/16  3:32:04: 332=Enclosure PD 20(c None/p0) element (SES code 0x17) status changed
07/02/16  3:32:04: SES_BackplaneMapping: Un-Associated device on enclPd 20 StsCode = 6 elmtType 17 elmtIndex 5 slotPd =5 SasAddr =0
07/02/16  3:32:04: SES_MarkBadElement: enclPd 20 timeDiff e1a1b slot 5 badElmt 1 retryCnt 0 oldTime:0 currentTime:e1a1b 
07/02/16  3:32:07: DM_REC: TUR Failed, devId =5 
07/02/16  3:32:09: SES_BackplaneMapping: Un-Associated device on enclPd 20 StsCode = 6 elmtType 17 elmtIndex 5 slotPd =5 SasAddr =0
07/02/16  3:32:09: SES_MarkBadElement: enclPd 20 timeDiff 5 slot 5 badElmt 1 retryCnt 1 oldTime:e1a1b currentTime:e1a20 
07/02/16  3:32:11: MPI_EVENT_SAS_DISCOVERY: PortBitmap ff - Discovery is in progress
07/02/16  3:32:11:  MPI_EVENT_SAS_PHY_LINK_STATUS - PhyNum 5 DevHandle 6 Link 90
07/02/16  3:32:11:  MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
07/02/16  3:32:11: MPT_EventDeviceStatusChange: Device Inserted Tgt 5 Sas 5000c500:134f0029
07/02/16  3:32:11: MPI_EVENT_SAS_DISCOVERY: PortBitmap 0 - Discovery is complete
07/02/16  3:32:11: DM_DevMgrIsChipInit 0 State 400 
07/02/16  3:32:11: DM: DM_DevSSUCallback TID 5 FAILED Cnt 0 Retry 0 Status 2
07/02/16  3:32:11: DM : DM_DevSSUCallback  SENSE Len 12 Code 70 senseKey 6 asc 29 ascq cd
07/02/16  3:32:11:  DM_DevNotifyRAID: Notify Done. Check for Removal 
07/02/16  3:32:11: gDevInfo=a11ebd20, size=140
07/02/16  3:32:11: Total Device = 7  
07/02/16  3:32:11: PD   Flags    State Type Size     S N Vendor   Product          Rev  P C ID SAS Addr         Port Phy DevH
07/02/16  3:32:11: ---  -------- ----- ---- -------- - - -------- ---------------- ---- - - -- ---------------- ---- --- ----
07/02/16  3:32:11: 0    f0400005 00020 00   22ecb25b 0 0 SEAGATE  ST3300657SS      0008 0 0 00 5000c500094d8a29 00   00  0a
07/02/16  3:32:11: 1    f0400005 00020 00   11177327 0 0 SEAGATE  ST3146356SS      HS11 0 0 01 5000c5001351cf75 01   01  0b
07/02/16  3:32:11: 2    f0400005 00020 00   11177327 0 0 SEAGATE  ST3146356SS      HS11 0 0 02 5000c50013514b99 02   02  0e
07/02/16  3:32:11: 3    f0400005 00020 00   11177327 0 0 SEAGATE  ST3146356SS      HS11 0 0 03 5000c500135178b1 03   03  0f
07/02/16  3:32:11: 4    f0400005 00020 00   22eec12f 0 0 HITACHI  HUS156030VLS600  A5D0 0 0 04 5000cca02a556ded 04   04  0c
07/02/16  3:32:11: 5    f0400005 00020 00   11177327 0 0 SEAGATE  ST3146356SS      HS11 0 0 05 5000c500134f0029 05   05  0d
07/02/16  3:32:11: 20   00400005 00020 0d   00000000 0 0 DP       BACKPLANE        1.07 0 0 20 500240807e894400 09   08  09
07/02/16  3:32:11: 100  00400005 00020 03   00000000 0 0 LSI      SMP/SGPIO/SEP    1909 0 0 ff                0 00   ff  00
07/02/16  3:32:11: PhyId 0 Sas 5000c500094d8a29 Type 1 IsSata 0, Smp 0:0
07/02/16  3:32:11: PhyId 1 Sas 5000c5001351cf75 Type 1 IsSata 0, Smp 0:0
07/02/16  3:32:11: PhyId 2 Sas 5000c50013514b99 Type 1 IsSata 0, Smp 0:0
07/02/16  3:32:11: PhyId 3 Sas 5000c500135178b1 Type 1 IsSata 0, Smp 0:0
07/02/16  3:32:11: PhyId 4 Sas 5000cca02a556ded Type 1 IsSata 0, Smp 0:0
07/02/16  3:32:11: PhyId 5 Sas 5000c500134f0029 Type 1 IsSata 0, Smp 0:0
07/02/16  3:32:11: PhyId 0 Sas 0 Type 0 IsSata 0, Smp 0:0
07/02/16  3:32:11: PhyId 0 Sas 0 Type 0 IsSata 0, Smp 0:0
07/02/16  3:32:11: Load Balance Statistics Path0PDs 0 Path1PDs 0
07/02/16  3:32:11: EVT#30221-07/02/16  3:32:11:  91=Inserted: PD 05(e0x20/s5)
07/02/16  3:32:11: EVT#30222-07/02/16  3:32:11: 247=Inserted: PD 05(e0x20/s5) Info: enclPd=20, scsiType=0, portMap=05, sasAddr=5000c500134f0029,0000000000000000
07/02/16  3:32:11: ArDiskTypeMisMatch : NO_MIXING_VIOLATION  array=0  destPD=5
07/02/16  3:32:11: EVT#30223-07/02/16  3:32:11: 114=State change on PD 05(e0x20/s5) from UNCONFIGURED_BAD(1) to UNCONFIGURED_GOOD(0)
07/02/16  3:32:19: EVT#30224-07/02/16  3:32:19: 332=Enclosure PD 20(c None/p0) element (SES code 0x17) status changed
07/02/16  3:36:53: mfiIsr: idr=00000020
07/02/16  3:36:53: Driver detected possible FW hang, halting FW.
07/02/16  3:36:53: Pending Command Details:

07/02/16  3:36:53: cmdId= 68: cmd=11, cmdStat=0, num_sg_elements=1, status=1 [PCI_COMMAND]
07/02/16  3:36:53: mfa=cd856800, mf=a041a000, mfSge=a041a028, bytesTransferred=0, next ffff
07/02/16  3:36:53: startTime=0, lines=a17c0680, lineMap=0, activeRecoveryCount=0, lockPromotedByRec=0
07/02/16  3:36:53: ldbbmAlreadyTried=0, ldbbmIssueWriteAsWV=0
07/02/16  3:36:53: ld=0, ioFlags=0, start_block=3fc6d0c6, num_blocks=8, savedNumBlocks=8
07/02/16  3:36:53: group: 0068
07/02/16  3:36:53: start_row=1fe368, start_strip=7f8da1, num_strips=1
07/02/16  3:36:53: ref_in_start_stripe=46, ref_in_end_stripe=4d, num_lines=1, num_lines_to_be_processed=1
07/02/16  3:36:53: wait_q_next_ptr=0, num_lines_in_wait_q=0, immediate_already_tried=0, bbmHead=ffffffff, bbmTail=ffffffff
07/02/16  3:36:53: rmw: op=0, error=0, first_data_arm=0, last_data_arm=0
07/02/16  3:36:53: wt: countRowsNotUsingReadPeer=0


07/02/16  3:36:53: cmdId=199: cmd=11, cmdStat=0, num_sg_elements=3, status=1 [PCI_COMMAND]
07/02/16  3:36:53: mfa=cd821c00, mf=a0466400, mfSge=a0466428, bytesTransferred=0, next ffff
07/02/16  3:36:53: startTime=0, lines=a185a190, lineMap=0, activeRecoveryCount=0, lockPromotedByRec=0
07/02/16  3:36:53: ldbbmAlreadyTried=0, ldbbmIssueWriteAsWV=0
07/02/16  3:36:53: ld=0, ioFlags=0, start_block=2e839706, num_blocks=18, savedNumBlocks=18
07/02/16  3:36:53: group: 0199
07/02/16  3:36:53: start_row=1741cb, start_strip=5d072e, num_strips=1
07/02/16  3:36:53: ref_in_start_stripe=6, ref_in_end_stripe=1d, num_lines=1, num_lines_to_be_processed=1
07/02/16  3:36:53: wait_q_next_ptr=0, num_lines_in_wait_q=0, immediate_already_tried=0, bbmHead=ffffffff, bbmTail=ffffffff
07/02/16  3:36:53: rmw: op=0, error=0, first_data_arm=0, last_data_arm=0
07/02/16  3:36:53: wt: countRowsNotUsingReadPeer=0


07/02/16  3:36:53: cmdId=1ab: cmd=11, cmdStat=0, num_sg_elements=2, status=1 [PCI_COMMAND]
07/02/16  3:36:53: mfa=cd833800, mf=a046ac00, mfSge=a046ac28, bytesTransferred=0, next 2b7
07/02/16  3:36:53: startTime=0, lines=a18632b0, lineMap=0, activeRecoveryCount=0, lockPromotedByRec=0
07/02/16  3:36:53: ldbbmAlreadyTried=0, ldbbmIssueWriteAsWV=0
07/02/16  3:36:53: ld=0, ioFlags=0, start_block=2623167e, num_blocks=10, savedNumBlocks=10
07/02/16  3:36:53: group: 01ab
07/02/16  3:36:53: start_row=13118b, start_strip=4c462c, num_strips=2
07/02/16  3:36:53: ref_in_start_stripe=7e, ref_in_end_stripe=d, num_lines=1, num_lines_to_be_processed=2
07/02/16  3:36:53: wait_q_next_ptr=0, num_lines_in_wait_q=0, immediate_already_tried=0, bbmHead=ffffffff, bbmTail=ffffffff
07/02/16  3:36:53: rmw: op=0, error=0, first_data_arm=0, last_data_arm=0
07/02/16  3:36:53: wt: countRowsNotUsingReadPeer=0


07/02/16  3:36:53: cmdId=260: cmd=11, cmdStat=0, num_sg_elements=14, status=1 [PCI_COMMAND]
07/02/16  3:36:53: mfa=cd8e2c00, mf=a0498000, mfSge=a0498028, bytesTransferred=0, next ffff
07/02/16  3:36:53: startTime=0, lines=a18be600, lineMap=0, activeRecoveryCount=0, lockPromotedByRec=0
07/02/16  3:36:53: ldbbmAlreadyTried=0, ldbbmIssueWriteAsWV=0
07/02/16  3:36:53: ld=0, ioFlags=0, start_block=3a413936, num_blocks=a0, savedNumBlocks=a0
07/02/16  3:36:53: group: 0260
07/02/16  3:36:53: start_row=1d209c, start_strip=748272, num_strips=2
07/02/16  3:36:53: ref_in_start_stripe=36, ref_in_end_stripe=55, num_lines=1, num_lines_to_be_processed=2
07/02/16  3:36:53: wait_q_next_ptr=0, num_lines_in_wait_q=0, immediate_already_tried=0, bbmHead=ffffffff, bbmTail=ffffffff
07/02/16  3:36:53: rmw: op=0, error=0, first_data_arm=0, last_data_arm=0
07/02/16  3:36:53: wt: countRowsNotUsingReadPeer=0


07/02/16  3:36:53: Total Pending Commands = 4
[0]: fp=a00bee78, lr=a0c41918  -  _MonTask+1a8
[1]: fp=a00bf0a0, lr=a0cc17e4  -  mfiIdrIsr+124
[2]: fp=a00bf0b8, lr=e401e960  -  dispatchIsrs+c4
[3]: fp=a00bf0e8, lr=e401e9f0  -  external_IRQ+34
[4]: fp=a00bf100, lr=e401e074  -  wrapper__External_IRQ+74
[5]: fp=a00bf150, lr=a0c68358  -  TaskStartNext+12c
:emotion-14:: fp=a00bf190, lr=a0c519c0  -  set_state+54
[7]: fp=a00bf278, lr=a0c536d0  -  raid_task+864
:emotion-29:: fp=a00bffa0, lr=a0cbd384  -  _main+aa4
[9]: fp=a00bfff8, lr=fe001d58  -  __start+ce0
MonTask: line 3622 in file ../../raid/1078dma.c
UIC_ER=10000ac:5500063, UIC_MSR=0:40, MSR=21000, sp=a00bee78
MegaMon> 
T0: LSI Logic ROC firmware
T0: Copyright (C) LSI Logic, 2004
T0: Firmware version 1.22.52-1909 built on Sep 21 2012 at 15:29:16

T0: pciInit: O_PCI_SERVICE = 00000005
T0: Initializing memory pool size=00300B24 bytes
T0: Press '!' within 3 seconds to enter debugger before INIT
T3: LogInit: Flushing events from previous boot
T3: EVT#30225-07/02/16  3:36:53:  15=Fatal firmware error: Driver detected possible FW hang, halting FW.

T3: EVT#30226-07/02/16  3:36:53:  15=Fatal firmware error: Line 3622 in ../../raid/1078dma.c

Moderator

 • 

6.2K Posts

July 2nd, 2016 16:00

Hello

I had a VD offline due to a firmware bug.

How did you come to this conclusion?

Are you saying that because of these two lines?

T3: EVT#30225-07/02/16 3:36:53: 15=Fatal firmware error: Driver detected possible FW hang, halting FW. T3: EVT#30226-07/02/16 3:36:53: 15=Fatal firmware error: Line 3622 in ../../raid/1078dma.c

I would suggest removing the physical disk that failed(PD 5). If the controller still experiences issues then I would make sure you are using a validated driver. We either provide drivers for download or a specific version of an operating system will contain a validated SAS driver. A device failure can cause communication issues with the controller which could lead to the controller not working properly.

Thanks

93 Posts

July 2nd, 2016 17:00

PD05 seems to work properly because after a simple reboot I run a smart extended check, a check consistency and a patrol read. All was OK.

Is not the first time that disks are kicked out without apparent reasons. I've opened a similiear thread some weeks ago, where a whole RAID was destroyed due to multiple disks kicked out (4 disks in a RAID-6 with 6 disks)

I'm using XenServer and I've updated to 6.5, getting a newer megaraid_sas driver but I don't think it's a driver issue. This issue came after my upgrade (as suggested by DELL) to the latest 6.3.3-0002 firmware for the PERC interface.

I would like to know if this issue was caused by a disk failure, a firmware bug or an older driver, so that couldn't happen anymore.

How can I check this ? I have another server very similiar and I don't want to loose data again

Moderator

 • 

6.2K Posts

July 2nd, 2016 18:00

I would like to know if this issue was caused by a disk failure, a firmware bug or an older driver, so that couldn't happen anymore.

I don't think any of those are the issue. I think the issue is that you have drives connected to the controller that are not validated. If the drives are not physically failing then it is a communication issue. The suggestion to update firmware was the right one. If you cannot find a firmware revision for your drives and the controller that are able to properly communicate then I would suggest that you stop using those drives.

Thanks

93 Posts

July 3rd, 2016 00:00

What do you mean with not validated drivers?

These are certified dell disks with latest firmware applied.

No Events found!

Top