Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

1702

May 30th, 2016 07:00

PS4000E with possible failed controller?

I have a PS4000E that I recently upgraded the firmware to 8.1.3. During the upgrade and subsequent reboot, the member would not come online and join the group. However, when we reboot it again, it will.

Currently, it is running on controller 0, and all is well. As a test, I initiated a restart, failing it over to controller 1, and again, it will not rejoin the group. I can ping the members management IP, I can even SSH into it in order to restart it again, back to controller 0. But it will not join the group from controller 1.

I'd really like to resolve this. This has been a very reliable system, but I'm now in a position to have to schedule downtime for the next firmware upgrade because the failover isn't working as it should. My warranty is expired on this array, so I cannot just submit a ticket either.

Suggestions?

5 Practitioner

 • 

274.2K Posts

May 31st, 2016 09:00

Hello, 

 First thing is make sure that both controllers are cabled up.  In the GUI, when you select the member, under the "Controllers" tab do you see both controllers?   Do they show running the same firmware? 

 You might need to get the serial cable and directly connect to that CM to see what the failure actually is.  It could be a defective compaq flash card

Regards,

Don

15 Posts

May 31st, 2016 10:00

1. Yes, everything is cabled up correctly. Prior to the upgrade to the latest firmware, failover happened without issue.

2. Yes, both controllers are visible in the GUI.

3. When I SSH to the management IP of the member (not the group), I am connected to the member. I issue the restart command while pinging the management IP. The restart takes place, the failover happens. There is a brief pause in ping while the failover happens. After the restart, the member does not rejoin the group, but I am able to SSH into the member using the management IP. I then issue another restart command to get it to fail back to the working controller.

I'm not sure where to go to get any error logs for why the member will not rejoin the group. This has never happened prior to the firmware upgrade, and so my only suspects are a problem with the firmware on that controller, failed or failing NICs on the controller, or something else that I am missing.

15 Posts

May 31st, 2016 11:00

No, the support contract has expired. And I agree with verifying the cabling to ensure that the passive controller is actually located on active ports. That will be my next step. Thanks.

5 Practitioner

 • 

274.2K Posts

May 31st, 2016 11:00

Hello, 

 In the GUI, when you look at the controllers, do they both show running the same version of FW? 

 From the description, since you can log into it after the failover it's not likely a failed controller.   I still suspect something with the network.  I would check the ports the "bad" CM is connected to, to make sure those ports are up or in the correct VLAN.  You might try swapping the cables one at a time between the active and passive or replace them altogether.   Maybe try different switch ports.   The members talk to each other over the iSCSI ports.  

 Is that array under support contract? 

 Regards,

Don 

5 Practitioner

 • 

274.2K Posts

May 31st, 2016 13:00

Hopefully that resolves it.  

If not, and you can take some down time fail over and connect to the member again.  At the Group CLI, run ping "-I {eth0 ip address} {destination ip}   Like a gateway or server.    Then do another with IP address of ETH1 on the PS4000. 


GrpName> mem sel MEMBERNAME show eths
Name ifType ifSpeed Mtu Ipaddress Status Errors DCB
---- --------------- ---------- ---- ----------------------------- ------ ------ ------
eth0 ethernet-csmacd 10 Gbps 9000 10.126.205.111 up 0
off
eth1 ethernet-csmacd 10 Gbps 9000 10.126.205.112 up 0
off
eth2 ethernet-csmacd 10 Mbps 1500 down 0
off

GrpName> ping "-I 10.126.205.111 10.126.192.1"
PING 10.126.192.1 (10.126.192.1): 56 data bytes
64 bytes from 10.126.192.1: icmp_seq=0 ttl=255 time=0.000 ms
64 bytes from 10.126.192.1: icmp_seq=1 ttl=255 time=0.000 ms
^C
----10.126.192.1 PING Statistics----
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.000/0.000/0.000/0.000 ms


GrpName> ping "-I 10.126.205.112 10.126.192.1"
PING 10.126.192.1 (10.126.192.1): 56 data bytes
64 bytes from 10.126.192.1: icmp_seq=0 ttl=255 time=10.000 ms
64 bytes from 10.126.192.1: icmp_seq=1 ttl=255 time=10.000 ms
^C
----10.126.192.1 PING Statistics----
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 10.000/10.000/10.000/0.000 ms

Good luck,

Don

5 Practitioner

 • 

274.2K Posts

June 1st, 2016 15:00

You are very welcome!  I'm glad that it was a quick fix! 

Don 

15 Posts

June 1st, 2016 15:00

It has been over 15 months since I was physically present on-site. Somewhere along the line, the 2 NICs from the inactive controller was in fact disconnected. Problem found. Thanks!

No Events found!

Top