Start a Conversation

Unsolved

This post is more than 5 years old

2739

January 16th, 2016 22:00

High incidence of node failure

Hello everyone,

We have 2 Centera clusters at remote sites that over the last 6 months have seen high occurrences of node failure.

These can be access or storage nodes and generally they can just go offline at random, requiring disk transplants into new node chassis. At the same time they will often trip the left power rails for the rack. (not a big deal with dual feeds).

For example on one 16 node cluster we have had  7 node failures in 6 months, and similar stats on another cluster.

I have raised it with our TAM who advised that they had identified a hardware fault on a particular node version and were just replacing them

"as they die", rather than a co-ordinated replacement. This is in line with what the onsite tech told me, who advised this has kept the local FSS's busy in that time.

The replacement nodes have a Model # 100-580-573 A15 and the nodes that are failing are 100-580-711-01 A02.

Anyone seeing similar things in the field ?

January 18th, 2016 03:00

HI

Please let me know the serial number of the cluster(s).

I assume you have already contacted our customer service how to solve this problem. (preventive node replacement, tech refresh etc.) as multiple failures happening at the same time might cause data loss

January 18th, 2016 21:00

Hi, thanks for the response. I have enquired about pro-active replacement but this hasn't been an option yet.

I'm happy to email you the Serial numbers of the clusters if you can please let me know your email address.

Rgds,  Brett

January 19th, 2016 00:00

Thank you, I have emailed the serial numbers to you 

January 19th, 2016 00:00

Sorry forgot to add a signatur:

Helmut.gotsche@emc.com

January 20th, 2016 16:00

...and another node has died ! Access node this time.

8 nodes in 6.5 months. Won't be too long and the whole cluster will have been swapped out one by one.

January 29th, 2016 06:00

Hi Brett@S


We have a Managed Service Offering for Centera. In the different Service Levels we monitor more than 100 of these systems.

There are many node failures since a couple of years. These are due to a condensator in the power supply that blows. Then the nodes are replaced with disktransplant.

I am not aware of a Field Change Order to replace these nodes in all systems. Generally, if you have enough available capacity to handle complete node failures then this worked generally for our 60+ customers.

But if you happen to have more than one node hard offline at a given time, then you'll have to escalate quickly through the 0800 numbers of EMC.

We've also seen that in these power supply failures breakers in the PDUs in the racks or even the breakers further out in the infrastructure broke. Happened that once a whole system was out of power due to a single node failure.

Now that the problem has been identified, I've been told by the support guys that the new nodes come with a power supply without that issue.


Best regards, Holger

No Events found!

Top