Start a Conversation

Unsolved

This post is more than 5 years old

1886

March 9th, 2017 09:00

After power loss previously working MD3220i no longer comes up

We were hit by a huge storm Tuesday morning which knocked out power and unfortunately our backup generator didn't come online, causing power loss on our entire rack.  Everything seems fine except for our PowerVault MD3220i which no longer comes online, lights up the drives, etc. when powered back on.  I have replaced both of the power supplies and still am unable to get it to work.  This unit has worked flawlessly since 2013.

I took both controllers out, pulled the batteries out and put them back in.

I am able to ping the management IP of both controllers, and they are showing activity, but am not able to connect to either of them in the storage manager application.

I connected up the serial cable and and receiving the following during boot:

-=<###>=-
Instantiating /ram as rawFs,  device = 0x1
Formatting /ram for DOSFS
Instantiating /ram as rawFs, device = 0x1
Formatting...Retrieved old volume params with %38 confidence:
Volume Parameters: FAT type: FAT32, sectors per cluster 0
  0 FAT copies, 0 clusters, 0 sectors per FAT
  Sectors reserved 0, hidden 0, FAT sectors 0
  Root dir entries 0, sysId (null)  , serial number f10000
  Label:"           " ...
Disk with 1024 sectors of 512 bytes will be formatted with:
Volume Parameters: FAT type: FAT12, sectors per cluster 1
  2 FAT copies, 1010 clusters, 3 sectors per FAT
  Sectors reserved 1, hidden 0, FAT sectors 6
  Root dir entries 112, sysId VXDOS12 , serial number f10000
  Label:"           " ...

RTC Error:  Real-time clock device is not working
OK.

Adding 14606 symbols for standalone.




Reset, Power-Up Diagnostics - Loop 1 of 1
3600 Processor DRAM
     01 Data lines                                                  Passed
     02 Address lines                                               Passed
3300 NVSRAM
     01 Data lines                                                  Passed
4410 Ethernet 82574 1
     01 Register read                                               Passed
     02 Register address lines                                      Passed
6D40 Bobcat
     02 Flash Test                                                  Passed
3700 PLB SRAM
     01 Data lines                                                  Passed
     02 Address lines                                               Passed
7000 SE iSCSI BE2 1
     01 Register Read Test                                          Passed
     02 Register Address Lines Test                                 Passed
     03 Register Data Lines Test                                    Passed
3900 Real-Time Clock
     01 RT Clock Tick                                               Passed
Diagnostic Manager exited normally.

Controller has been locked down due to Hardware errors:

================= EXCEPTION LOG =================
Serial number:     29T005W
Entry count:       8
Wrap-arounds:      0
First entry time:
Current Controller date/time: MAR-09-2017 06:51:58 AM
Current Local (User) date/time: MAR-09-2017 04:18:23 PM

---- Log Entry #0 (Core 0) DEC-11-2012 02:22:20 PM ----

WARNING: Reset by alternate controller

---- Log Entry #1 (Core 0) DEC-11-2012 02:46:05 PM ----

WARNING: Reset by alternate controller

---- Log Entry #2 (Core 0) DEC-11-2012 03:53:04 PM ----

WARNING: Reset by alternate controller

---- Log Entry #3 (Core 0) AUG-06-2013 09:01:28 PM ----

WARNING: Reset by alternate controller

---- Log Entry #4 (Core 0) NOV-15-2013 02:25:04 AM ----
11/15/13-10:06:49 (tNtbErrPolling): PANIC: PLX NTB Port 4 reg 0x000044a4 changed, original val 0x00000000, current val 0x00000010

Stack Trace for tNtbErrPolling:
0x0025ffac vxTaskEntry  +0x5c : vkiTask (0x15000308)
0x0016844c vkiTask      +0xec : ntbErrPolling ()
0x00143c20 ntbErrPolling+0x2a0: ntbRegCompare (0x4, 0xac8e10)
0x00143000 ntbRegCompare+0x100: _vkiCmnErr ()
0x00163544 _vkiCmnErr   +0x104: 0x00163780 (0x585580, 0x4f8be0, 0xd838e0)
0x00163b04 vkiLogShow   +0x544: psvJobAdd (0x1648a0, 0xd83a40, 0, 0)
0x00148c04 psvJobAdd    +0x64 : msgQSend ()
0x00402714 msgQSend     +0x61c: taskUnlock ()

---- Log Entry #5 (Core 0) NOV-15-2013 02:25:38 AM ----

WARNING: Reset by alternate controller

---- Log Entry #6 (Core 0) AUG-04-2014 09:28:29 PM ----

WARNING: Reset by alternate controller

---- Log Entry #7 (Core 0) DEC-15-2014 06:33:01 PM ----
12/16/14-02:43:26 (tNtbErrPolling): PANIC: PLX NTB Port 0 reg 0x00000364 changed, original val 0x00000000, current val 0x00000020

Stack Trace for tNtbErrPolling:
0x0026070c vxTaskEntry  +0x5c : vkiTask (0x15000308)
0x00168b4c vkiTask      +0xec : ntbErrPolling ()
0x00144308 ntbErrPolling+0x288: ntbRegCompare (0, 0xebf6a0)
0x00143700 ntbRegCompare+0x100: _vkiCmnErr ()
0x00163c44 _vkiCmnErr   +0x104: 0x00163e80 (0x585f20, 0x4f9320, 0xd84290)
0x00164204 vkiLogShow   +0x544: psvJobAdd (0x164fa0, 0xd843f8, 0, 0)
0x00149304 psvJobAdd    +0x64 : msgQSend ()
0x00402e54 msgQSend     +0x61c: taskUnlock ()

---- Log Entry #8 (Core 0) DEC-15-2014 06:33:39 PM ----

WARNING: Reset by alternate controller

---- Log Entry #9 (Core 0) DEC-15-2015 06:40:42 AM ----

Root Complex TLP header[0] 30008000
Root Complex TLP header[1] 01200033
Root Complex TLP header[2] 00000000
Root Complex TLP header[3] 00000000


PCI SERR Exception
   PLX PCI-E Switch  (Unit 0)
        VID 0x10b5 DID 0x8632 B0:D0:F0
        PCI Status = 0x4010
        Bridge Secondary PCI Status = 0x4000
   PLX PCI-E Bridge to Host Card  (Unit 1)
        VID 0x10b5 DID 0x8632 B1:D4:F0
        PCI Status = 0x4010
        PCI-E Device Status = 0x0005
        PCI-E AER Uncorrectable Status = 0x00040000
          Header Log 0 = 0x00000044
          Header Log 1 = 0x00000044
          Header Log 2 = 0x00000044
          Header Log 3 = 0x20008080
        PCI-E AER Correctable Status = 0x00000040

---- Log Entry #10 (Core 0) DEC-15-2015 06:40:45 AM ----

WARNING: Restart by watchdog time out

---- Log Entry #11 (Core 0) DEC-15-2015 06:41:19 AM ----

WARNING: Reset by alternate controller

---- Log Entry #12 (Core 0) AUG-06-2016 03:55:22 PM ----

WARNING: Reset by alternate controller

---- Log Entry #13 (Core 0) MAR-06-2017 09:43:27 PM ----
ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
ERROR: Port 5 Bad DLLP Count 1073743408 exceeds threshold 16
ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

---- Log Entry #14 (Core 0) MAR-06-2017 09:43:27 PM ----
ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

---- Log Entry #15 (Core 0) MAR-06-2017 09:46:14 PM ----
ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
ERROR: Port 5 Bad DLLP Count 1073743408 exceeds threshold 16
ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

---- Log Entry #16 (Core 0) MAR-06-2017 09:46:14 PM ----
ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

---- Log Entry #17 (Core 0) MAR-06-2017 09:47:35 PM ----

Faults are detected on all installed power supplies

---- Log Entry #18 (Core 0) MAR-06-2017 09:50:18 PM ----
ERROR: Port 0 Bad TLP Count 526208 exceeds threshold 16
ERROR: Port 4 Bad TLP Count 526208 exceeds threshold 16
ERROR: Port 5 Bad TLP Count 526208 exceeds threshold 16
ERROR: Port 6 Bad TLP Count 526208 exceeds threshold 16
ERROR: Port 0/4 Rx Err Count 128 exceeds threshold 16

---- Log Entry #19 (Core 0) MAR-06-2017 09:50:18 PM ----
ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0x678 val 0x80

03/09/17-16:18:23 (tSystem): ERROR: FPGA FW is out of date
"Rhone03 rev17" currently in use
"Rhone03 rev20" available for update
Current date: 03/09/17  time: 16:18:23

Send for Service Interface or baud rate change
03/09/17-16:18:27 (tNetCfgInit): NOTE:  eth0: LinkUp event
03/09/17-16:18:28 (tNetCfgInit): NOTE:  Acquiring network parameters for interface gei0 using DHCP
03/09/17-16:18:37 (ipdhcpc): NOTE:  netCfgDhcpReplyCallback :: received OFFER on interface gei0, unit 0
03/09/17-16:18:38 (ipdhcpc): NOTE:   DHCP server: 10.0.0.1
03/09/17-16:18:38 (ipdhcpc): WARN:   **WARNING** The DHCP Server did not assign a permanent IP for gei0.
03/09/17-16:18:38 (ipdhcpc): WARN:               Network access to this controller may eventually fail.
03/09/17-16:18:38 (ipdhcpc): NOTE:   DNS domain name: XXXXXX.com
03/09/17-16:18:38 (ipdhcpc): NOTE:   DHCP client name: md3220i-mgmt
03/09/17-16:18:38 (ipdhcpc): NOTE:   Client DNS name servers: 10.0.0.1
03/09/17-16:18:38 (ipdhcpc): NOTE:   Client IP routers:   10.0.0.1
03/09/17-16:18:38 (ipdhcpc): NOTE:   Assigned IP address: 10.0.0.122
03/09/17-16:18:38 (ipdhcpc): NOTE:   Assigned subnet mask: 255.255.255.0
03/09/17-16:18:38 (tNetReset): NOTE:  Network Ready


I replaced both of the power supplies, so not sure why the error is still there or how to clear it.  I got into the vxworks shell, but am not sure how I can clear this so it starts again.

Anyone have an idea how to clear this?

Moderator

 • 

7.1K Posts

March 10th, 2017 09:00

Hello kgroup,

The error that you posted is that the complete error & it starts over again when you are connected to the serial interface of controller 0? Also are any of the virtual disk online or are they also offline?

Please let us know if you have any other questions.

2 Posts

March 10th, 2017 13:00

That error is from the output of the serial putty session.  If I powerdown and bring up controller 0 or 1 individually it continually outputs the same message about the bus errors and doesn't proceed any further.

I am getting that with both controllers.  I was able to get the shell and all of the standard commands are not known (i.e. lemClearLockdown, etc.) all are unknown, almost like nothing loaded at all for the controllers as it has no knowledge about any of the commands. 

Forgot to answer your last question.  None of the drive lights even come on.  The controller basically immediately enters a failure state when powered on.  No drives light up at all.

Moderator

 • 

7.1K Posts

March 15th, 2017 07:00

Hello kgroup,

Since you have tried to boot with just a single controller and the system won’t boot then I would say that your controllers needs to be replaced. The controller that the output came from is showing signs that it is failed. If the output is the same for the alt controller then both need to be replaced.

Please let us know if you have any other questions.

No Events found!

Top