Simulation of possible failure scenarios

sometimes we need to show that all EMC Isilon properties which EMC offers and we of course 'sell' are true.

For example: if we sell an Isilon cluster with the property that a protection scheme of +2n withstands the failure of either two hard drive disks or two complete nodes without data loss - than we have to prove that statement. Most of the customers believe that 'as it is' and don't want us to show them. But there are always customers who want to see such features according to compliance tests.

Most failure scenarios can easily be shown or reproduced as for example

  • link aggregation >> disconnecting network cables
  • dual power supply >> disconnecting power cords
  • Infiniband redundancy >> disconnecting Infiniband cables or power off one IB switch
  • SmartConnect features >> doing some mounts from one client to the same SmartConnect zone; client distribution over Nodes
  • etc.

But however - some scenarios are hard to simulate.

How could we simulate a HDD failure? EMC advised against simple pulling HDDs out of the chassis and putting it later back. I think at least we have to wait the FlexProtect job finished and format the disk before putting it back to cluster.

The same for a Node failure. Is just disconnecting the IB cables enough? Or maybe shutting the Node down and shown the customer that all data is still there and accessible with one Node off.

Any ideas or experiences with this kind of compliance tests?

We don't want real data loss caused by a strange compliance test. In that case the customer has to believe what EMC is claiming.



Re: Simulation of possible failure scenarios

Hi Phil,

yes that's hard to simulate. No idea (if pulling the disk out is no Option) for the disks but you could simulate the nodefailure by disconnecting the power-cable. this would come closest to a real fail, but a shutdown will be a lot safer and should be "comparable".

No guarantee that you don't damage the node by the powerloss though.

We did those "hard tests" and the isilon survived...


-- sluetze

Re: Simulation of possible failure scenarios

Thank you for your hints.

Anybody else who did some tests of the "Isilon high availability promise"?

I need to know which tests are safe to do - and which tests I should better not do to risk data loss or other failures.

If EMC pretends certain properties and functions of their clusters they should be able to describe test scenarios for testing these things.

