Highlighted
javiervila
Bronze

Lost connection to MD3000 when performing a controller replacement

Hi All,

I have an MD3000 with two controllers. A while ago I checked out Modular Disk Storage Manager health status and I had a controller that needed replacement, so I replaced it today and rebooted the system because the error persisted.

After the reboot I completely lost connection to the device.

These are the observations I made:

1. When the array powers up pin to the management interface of one of the controllers is OK.

2. After some time this management interface stops answering to ping.

3. Tried with the substituted controller out as well, with the sam result.

I can ping the iscsi interfaces but it looks that with this software I cannot add manually the array with this IP addresses, neither do I with the management interface of course.

So practically I have a huge amount of data completely lost, because the array is unaccessible, what a mess!!

Is there a way that I can access this data anyhow??

PLEASE HELP:...

Thanks! javi

0 Kudos
11 Replies
Moderator
Moderator

RE: Lost connection to MD3000 when performing a controller replacement

Hello Javi,

When you added the replacement controller did you let it sync with the controller that was in your MD3000? If you use the serial cable that comes with your MD3000, then we can see if the controller is able to boot fully, or if it is getting stopped at some point.

Startup a terminal emulation program like putty, teraterm, minicom or hyperterminal using these terminal settings (115200-8-n-1).

Pull the controller from the system & wait about a minute then insert it back into your MD3000i and you should see the controllers boot process.

Please let us know if you have any other questions.

DELL-Sam L
Dell | Social Outreach Services - Enterprise
Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

0 Kudos
javiervila
Bronze

RE: Lost connection to MD3000 when performing a controller replacement

Hi! Thanks for the help. Everytime I boot up the controller it happens the same. If you can see, at the end of the log there is this message: "Exception: Data Abort cpsr:  60000013   pc:  0x". After this, no connection to the management interface. Here is the full log:

-=<###>=-
Attaching interface lo0... done

Adding 9767 symbols for standalone.
Error
09/20/17-10:11:21 (GMT) (tRootTask): NOTE:  I2C transaction returned 0x0423fe00




Reset, Power-Up Diagnostics - Loop 1 of 1
3600 Processor DRAM
     01 Data lines                                                  Passed
     02 Address lines                                               Passed
3300 NVSRAM
     01 Data lines                                                  Passed
5900 Ethernet 91c111 #1
     01 Register read                                               Passed
     02 Register test                                               Passed
3A00 NAND Flash
     06 Bad Blocks Test                                             Passed
2310 Application Accelerator Unit
     01 AAU Register Test                                           Passed
6D00 LSI SAS 1068 IOC--Base Board
     01 IOC Register Read Test                                      Passed
     02 IOC Register Address Lines Test                             Passed
     03 IOC Register Data Lines Test                                Passed
6F01 QLOGIC EP4032 CHIP 0
     01 Register Read Test                                          Passed
     02 Register Address Lines Test                                 Passed
     03 Register Data Lines Test                                    Passed
3900 Real-Time Clock
     01 RT Clock Tick                                               Passed
Diagnostic Manager exited normally.


Current date: 09/20/17  time: 02:10:07

Send <BREAK> for Service Interface or baud rate change
09/20/17-10:11:40 (GMT) (tRAID): NOTE:  Set Powerup State
09/20/17-10:11:40 (GMT) (tRAID): NOTE:  SOD Sequence is Normal, 0
09/20/17-10:11:40 (GMT) (tRAID): NOTE:  SOD: removed SAS host from index 0
09/20/17-10:11:40 (GMT) (tRAID): NOTE:  In iscsiIOQLIscsiInitDq.  iscsiIoFstrBas
e = 0x0
09/20/17-10:11:40 (GMT) (tRAID): NOTE:  Turning on tray summary fault LED
09/20/17-10:11:42 (GMT) (tRAID): NOTE:  SYMBOL: SYMbolAPI registered.
09/20/17-10:11:42 (GMT) (tRAID): NOTE:  lost persistent dq data because buffer w
as modified or size changed.
esmc0: LinkUp event
09/20/17-10:11:43 (GMT) (tNetCfgInit): NOTE:  Network Ready
09/20/17-10:11:46 (GMT) (tRAID): NOTE:  Initiating Drive channel: ioc:0 bringup
09/20/17-10:11:48 (GMT) (tRAID): NOTE:  IOC Firmware Version: 00-24-63-00
09/20/17-10:11:56 (GMT) (tSasEvtWkr): NOTE:  sasIocPhyUp: chan:1 phy:0 prevNumAc
tivePhys:2 numActivePhys:2
09/20/17-10:11:56 (GMT) (tSasEvtWkr): NOTE:  sasIocPhyUp: chan:1 phy:1 prevNumAc
tivePhys:2 numActivePhys:2
09/20/17-10:12:06 (GMT) (tRAID): NOTE:  IonMgr: Drive Interface Enabled
09/20/17-10:12:06 (GMT) (tRAID): NOTE:  SOD: Instantiation Phase Complete
09/20/17-10:12:06 (GMT) (tRAID): WARN:  No attempt made to open Inter-Controller
 Communication Channels
09/20/17-10:12:06 (GMT) (tRAID): NOTE:  Failing The Alternate Controller
09/20/17-10:12:06 (GMT) (tRAID): WARN:  Alt Ctl Reboot:
                                Reboot CompID: 0x401
                                Reboot reason: 0x6
                                Reboot reason extra: 0x0
09/20/17-10:12:06 (GMT) (tRAID): NOTE:  holding alt ctl in reset
09/20/17-10:12:06 (GMT) (tRAID): NOTE:  LockMgr Role is Master
09/20/17-10:12:06 (GMT) (tRAID): WARN:  FBM:validateSubModel: Exception - Alt co
ntroller not ready
09/20/17-10:12:06 (GMT) (tSasDiscCom): NOTE:  SAS Discovery complete task spawne
d
09/20/17-10:12:07 (GMT) (tRAID): NOTE:  spmEarlyData: No data available
09/20/17-10:12:07 (GMT) (sasCheckExpanderSet): NOTE:  Expander Firmware Version:
 0116-e05c
09/20/17-10:12:07 (GMT) (sasCheckExpanderSet): NOTE:  Expander SAS address: Hi =
 x5a4badb4 Low = x4e0f0f10
09/20/17-10:12:12 (GMT) (tSasDiscCom): WARN:  SAS: Initial Discovery Complete Ti
me: 30 seconds
09/20/17-10:12:12 (GMT) (tRAID): NOTE:  WWN baseName 0004a4ba-db4e0c98 (valid==>
SigMatch)
09/20/17-10:12:12 (GMT) (tRAID): NOTE:  IonMgr: Host Interface Enabled
09/20/17-10:12:12 (GMT) (tRAID): NOTE:  SOD: Pre-Initialization Phase Complete
09/20/17-10:12:13 (GMT) (tRAID): WARN:  BID: initialize(): Power latched!
09/20/17-10:12:23 (GMT) (tRAID): NOTE:  ACS: Icon ping to alternate failed: -2,
resp: 0
09/20/17-10:12:23 (GMT) (tRAID): NOTE:  ACS: autoCodeSync(): Process start. Comm
 Mode: 0, Status: 0
09/20/17-10:12:23 (GMT) (tRAID): WARN:  ACS: autoCodeSync(): Skipped since alt n
ot communicating.
09/20/17-10:12:23 (GMT) (tRAID): NOTE:  SOD: Code Synchronization Initialization
 Phase Complete
09/20/17-10:12:23 (GMT) (tRAID): NOTE:  Caught IconSendInfeasibleException Error
 in iop::requestAltIopDelay
09/20/17-10:12:24 (GMT) (tRAID): NOTE:  CheckInMonitor: Check-in failed (IconSen
dInfeasibleException Error)
09/20/17-10:12:24 (GMT) (NvpsPersistentSyncM): NOTE:  NVSRAM Persistent Storage
updated successfully
09/20/17-10:12:24 (GMT) (tRAID): NOTE:  USM Mgr initialization complete with 0 r
ecords.
09/20/17-10:12:24 (GMT) (tRAID): WARN:  Received IconSendInfeasibleException Err
or adding small edr records from alt controller
09/20/17-10:12:25 (GMT) (tRAID): WARN:  spm: unable to exchange features, assumi
ng none
09/20/17-10:12:25 (GMT) (tRAID): NOTE:  SPM acquireObjects exception: IconSendIn
feasibleException Error
09/20/17-10:12:25 (GMT) (tRAID): NOTE:  DBRead               0.176 secs
09/20/17-10:12:25 (GMT) (tRAID): NOTE:  sas: Peering Disabled (Alt Unavailable)
09/20/17-10:12:26 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
 03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
09/20/17-10:12:53 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
 timeout
09/20/17-10:12:53 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
eout, cmd = 0x69
09/20/17-10:12:54 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
:/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
09/20/17-10:12:54 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
 returned -1

09/20/17-10:12:54 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
09/20/17-10:12:54 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
ed.  Stat f000
09/20/17-10:12:54 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
 4543
09/20/17-10:12:54 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
09/20/17-10:12:55 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
 03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
09/20/17-10:13:23 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
 timeout
09/20/17-10:13:23 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
eout, cmd = 0x69
09/20/17-10:13:23 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
:/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
09/20/17-10:13:23 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
 returned -1

09/20/17-10:13:23 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
09/20/17-10:13:23 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
ed.  Stat f000
09/20/17-10:13:23 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
 4543
09/20/17-10:13:23 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
09/20/17-10:13:25 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
 03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
09/20/17-10:13:52 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
 timeout
09/20/17-10:13:52 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
eout, cmd = 0x69
09/20/17-10:13:53 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
:/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
09/20/17-10:13:53 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
 returned -1

09/20/17-10:13:53 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
09/20/17-10:13:53 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
ed.  Stat f000
09/20/17-10:13:53 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
 4543
09/20/17-10:13:53 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
09/20/17-10:13:54 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
 03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
09/20/17-10:14:21 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
 timeout
09/20/17-10:14:21 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
eout, cmd = 0x69
09/20/17-10:14:22 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
:/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
09/20/17-10:14:22 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
 returned -1

09/20/17-10:14:22 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
09/20/17-10:14:22 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
ed.  Stat f000
09/20/17-10:14:22 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
 4543
09/20/17-10:14:22 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
09/20/17-10:14:23 (GMT) (tRAID): NOTE:  QLStartFw: Downloading Driver's FW image
 03.00.01.47 from 0058c2e0 4c0c8 bytes , result 0
09/20/17-10:14:50 (GMT) (tRAID): WARN:  QLMailboxCommand: Cmd = 0069, completion
 timeout
09/20/17-10:14:50 (GMT) (tRAID): WARN:  QLMailboxCommand: command completion tim
eout, cmd = 0x69
09/20/17-10:14:51 (GMT) (tRAID): NOTE:  Qlogic coredump file written to 'H2BFR4J
:/tmp/QLogic_Coredump_port_0_H2BFR4J',rc 204E50, expected 204E50
09/20/17-10:14:51 (GMT) (tRAID): WARN:  Qlogic coredump file write failed.fclose
 returned -1

09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
09/20/17-10:14:51 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
ed.  Stat f000
09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
 4543
09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
09/20/17-10:14:52 (GMT) (tRAID): WARN:  QLStartAdapter: ControllerErrorCount exc
eeds threshold.
09/20/17-10:14:52 (GMT) (tRAID): ERROR: QLInitializeDevice: QLStartAdapter faile
d
09/20/17-10:14:52 (GMT) (tRAID): ERROR: QLAddDevice: controller/device/chip init
ialization failed.
09/20/17-10:14:52 (GMT) (tRAID): ERROR: qlgEnableHostInterface: QLInitializeDevi
ce failed.
09/20/17-10:14:52 (GMT) (tRAID): NOTE:  ****************************************
****************************************
09/20/17-10:14:52 (GMT) (tRAID): NOTE:    QLogic Target Application, Version 2.0
1.08 6-13-2005 (W2K)
09/20/17-10:14:52 (GMT) (tRAID): NOTE:          iSCSI Target Application
09/20/17-10:14:52 (GMT) (tRAID): NOTE:   ***************************************
*****************************************

Exception: Data Abort
cpsr:  60000013   pc:  0x

0 Kudos
Moderator
Moderator

RE: Lost connection to MD3000 when performing a controller replacement

Hello Javi,

Thanks for the serial capture as it helps. So after looking at the capture I see the following error:

09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLProcessSystemError: Restart RISC
09/20/17-10:14:51 (GMT) (tRAID): ERROR: QLGetFwState: MBOX_CMD_GET_FW_STATE fail
ed.  Stat f000
09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLRebootTimer: Status after Get FW State
 4543
09/20/17-10:14:51 (GMT) (tRAID): NOTE:  QLRebootTimer: QLGetFwState failed
09/20/17-10:14:52 (GMT) (tRAID): WARN:  QLStartAdapter: ControllerErrorCount exc
eeds threshold.

When I see that error that is normally means that the controller is dead. I know you stated that you replaced the controller already once. The controller can’t go through its own POST so that is why you are getting this error. What you will need to see is if that is a slot issue or a controller issue. If you put the controller in the other slot and it reports the same then it is the controller.

Please let us know if you have any other questions.

DELL-Sam L
Dell | Social Outreach Services - Enterprise
Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

0 Kudos
javiervila
Bronze

RE: Lost connection to MD3000 when performing a controller replacement

Hi Sam,

Thanks for the help. We had another controller replacement and we inserted the controller with no other controllers attached to the array and we were able to boot the storage up.

Can you tell me if there is a specific procedure defined to add a new controller to an MD3000 that is currently running on only one controller? Recovery guru just says to attach it and that's it.

Cheers,

Javi

0 Kudos
Moderator
Moderator

RE: Lost connection to MD3000 when performing a controller replacement

Hello Javi,

If the system was running dual controller before then yes you will insert the controller and wait bout 10minutes. The 10 minutes is to allow the replacement controller to sync with the current controller, and gather all the information. Once that is done then all you will need to do is to online the controller in MDSM.

If your MD3000 is running in simplex mode then you will need to do the conversion to duplex mode. Here is a guide that explains how that is done. http://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_powervault/powervault-md30...

Please let us know if you have any other questions.

DELL-Sam L
Dell | Social Outreach Services - Enterprise
Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

0 Kudos
javiervila
Bronze

RE: Lost connection to MD3000 when performing a controller replacement

Hi Sam! Thanks for the information. I have alway had two controlles but I do not know if I have simplex or duplex mode, how can I check it? Is this information sayin I am running on duplex mode?

Ethernet port:              1                  
            Link status:             Up                 
            MAC address:             a4:ba:....
            Negotiation mode:        Manual setting     
               Port speed:           100 Mbps           
               Duplex mode:          Full duplex        
            Network configuration:   Static             

0 Kudos
Moderator
Moderator

RE: Lost connection to MD3000 when performing a controller replacement

Hello javiervila,

If you have had 2 controllers then you are already in duplex mode. So I would just insert the controller and give it about 10 minutes to sync with the active controller. Once that is complete then you will want to online it in MDSM.

Please let us know if you have any other questions.

DELL-Sam L
Dell | Social Outreach Services - Enterprise
Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

0 Kudos
javiervila
Bronze

RE: Lost connection to MD3000 when performing a controller replacement

Hi Sam,

I was able to replace the controller and now I have the controller online. However I have two errors on recovery guru:

1)

Storage array:  STORAGE_PFN_2
Component reporting problem:     Thermal sensor  
  Status:     Not available
  Location:  Expansion enclosure 0
  Component requiring service:  Temperature sensor

 

 

2)

 

Storage array:  STORAGE_PFN_2
Component reporting problem:     Host Board Left
  Status:     Not available
  RAID Controller Module:  Slot 0
  Service action (removal) allowed:  No
    Service action LED on component:  No

Are these critical errors? How can I solve them?

Thanks,

Javier

0 Kudos
Moderator
Moderator

RE: Lost connection to MD3000 when performing a controller replacement

Hello Javier,

When you replace a controller it is not uncommon to see these errors come up. Once the controller has been replaced I would give the MD3000 about 5 minutes then run the check in the Recovery Guru again to see if the errors are still present.

Please let us know if you have any other questions.

DELL-Sam L
Dell | Social Outreach Services - Enterprise
Download the Dell Quick Resource Locator app today to access PowerEdge support content on your mobile device! (iOS, Android, Windows)

0 Kudos