Equallogic single point of failure?

Question

Hello, If one SAN failed, I would not have interuption of services. In other words, how to make SAN like DRDB master-master? thanks

Joe S586 · Answer

Hi, I’m Joe from Dell EqualLogic. The Dell EqualLogic has replication to another PS Array gourp where you can copy a volume of data from one group to another, protecting the data from a variety of failures, ranging from the destruction of a volume to a complete site disaster, with no effect on data availability or performance. Essentially, this is a block by block copy of changes of one volume to another location. However, the replication at the remote site volumes are not directly accessible, they need to be brought “on line”, and the replicated volume is either cloned to create a new volume on the secondary group, or promote the replica set to a recovery volume (and snapshots) on the secondary group and configure iSCSI initiators to connect to the recovery volume.

You can get a better description if you download the Group Administration Guide from the www.equallogic.com/support web site (support account required), look for the Guide in the section for the firmware currently running on your array.

Regards,
Joe

Zeppi · Answer

Hello Joe,

Yes, I understand and that's the problem. That means I can not centralize all on this SAN, because if something goes wrong it will take me a day or two to restart each machine and verify that all are well.

Imagine, you go up a cluster on multiple machines for high availability and if the SAN is down the cluster falls. It is not very useful.

Regars
Giuseppe

Joe S586 · Answer

Giuseppe,

I’m sure you are aware that the EqualLogic array does offer Enterprise class protection, i.e., redundant power supplies, controllers, and disks ; RAID protection of the data; hot swappable components; enhanced RAID rebuild, fault isolation, etc. So the redundancy of the Pier Storage Array design does eliminate single points of failure, helping to provide greater than 99.999 percent availability. This is of course only true if the Data Center site is operational; i.e., switches, power, cooling, physically accessible, etc.

So in the case of DRBD, if the single array group availability is less then you require, you might consider two Array groups, one to handle “Service A” volumes, than build the DRDB mirror to the other PS Group “Service B” to mirror the volumes.

If you have a specific instance (product link, etc.) I can look at it and I may be able to offer another solutions/recommendations?

Regards,
Joe

Zeppi · Answer

Hello,

You're right, it is certain that Equalogic is a reliable. It happens that I have make a cluster Oracle and I realized that I need two of Equalogic when I wanted to upgrade the firmeware of Equalogic. In fact, I had to reboot the Equalogic well as on the cluster.

So I wondered how I could make to ensure continuity of service for my cluster oracle?

Now I know that Equalogic must restart after an update, I think that with a second Equalogic I could replicate the SAN data A to B to update the A. Then comes the turn of B.

The problem extends that I should not use more than half the capacity of the SAN.

Regards

Joe S586 · Answer

Giuseppe,

I understand your concern, however, the restart during the firmware update should only take @15-20 seconds. If the host server iSCSI Initiator “disk timeouts” are set properly, then there should be no disruption of service (from the server’s point of view). The latest release notes for the array’s firmware has the recommended iSCSI timeout values for each host type.

Regards,
Joe

seboulba · Answer

Hi Joe and Giuseppe,

I found your discussion while searching google for HA solutions for our Citrix-Xen based cloud of servers.

The first solution that came up was the one using drbd master:master on iscsi targets, on separate SANs (freebsd+zfs+iscsi).
So that if a SAN goes down or must be shut down for hardware upgrade the service is still available and the virtual machines (around 500) doesn't need to be restarted, checked etc.

Then we heard a lot of good about the Equallogic storage arrays and it's integration with Citrix xen. So we thought about acquering two of them.

After a couple of research we found out too that it was not possible to have real-time replication between the arrays, bring one unit down and have the service to be ininterrupted. And it's kinda a big issue for us too.

Isn't it possible to do this with Equalogic Arrays ? I thought that High Avalability supposes that you can bring a whole array down without service interruption ?

Any ideas ?

Thanks for your help.

Joe S586 · Answer

seboulba,

From my understanding, with DRDB setup properly, if you are connected to 2 separate arrays groups, you should be able to bring one half of the DRDB down for maintenance without disruption of service. DRDB is configured on the servers, not the array.

Regards,
Joe

SirStan · Answer

Some terminology might be helpful here. Equallogic embeds their controller/filer/and disk shelves into one unit. The controllers are active/passive meaning only one controller is ever usable. The filer itself is tied directly to the disks. Other vendors handle this in different ways. Dell has chosen with the Equallogic system to do this.

Some vendors implement "raid" across the filers themselves (HP LeftHand's network raid). Other vendors offer active/active controllers, or NetApp metrocluster functionality. Dell Equallogic does not.

We operate three Equallogic arrays in production use, and have never suffered a controller failure. When we preform firmware updates the unit reboots twice, taking it offline for 15 seconds. We do this during 'quiet' activity hours on our VMware, SQL Server, and Exchange clusters. They seem to handle the 15 second downtime without issue.

We have not seen the Active/Passive controller layout of the Dell Equallogic as a negative. The failure of an entire Equallogic filer (both controllers and both power supplies) is extremely extremely rare. There are no shared components between the controllers, they are functionally separate filers. The unit is right-sized for our organization and provides enterprise functionality at a fraction of the cost of a similar product that wold allow Active/Active enterprise level controllers, or Metrocluster functionality.

In summary:
> Dell Equallogic does not allow Active/Active Controllers, or full 'raid' between discrete units.
> Dell does not offer 'Metrocluster' or 'Network Raid' functionality like DRDB.
> Reboots of the entire SAN take 15 seconds (yes, really, as a customer, not Dell marketing) and do not cause any issue for us.

bennice2002 · Answer

We have now implemented close to 30 Equallogic arrays in our environment and have been a customer for the past few years now. The question about clustering comes up a lot when explaining the high availability nature of these arrays, and comparisons are often made with the HP/Lefthand SANs (which we also have a large number of). I've only had one issue with an Equallogic array going offline, and the issue was with an older firmware version. Other than that, the Equallogic arrays have been solid, and moreso than our Lefthand arrays. Despite the logic regarding network RAID, it's not infallible, and we've experienced complete SAN outages due to bugs in their software too. And due to the fact Lefthhand runs on some of HP's lowest end hardware all but guarantees that you will experience some significant hardware failures over the course of time, so 3 arrays is pretty much a minimum requirement for HA.

I would like it if Dell/Equallogic would introduce some synchronous replication between arrays, as it would definitely help us in many situations. But with dual controllers, power supplies, etc., the units have adequate protection against failure for most situations.

As SirStan posted, the 15 second blip between controller restarts is about the maximum downtime you can expect, and having your iSCSI timeout values set correctly is very important, especially for MSSQL or Exchange.

Mattrst · Answer

I run an environment of 22 EqualLogic units and use storage foundation in various clusters when extra resilience is needed.

Obviously sync write software with cost twice the capacity but I would recommend to any sys admin.

JOHNADCO · Answer

'Hello, You're right, it is certain that Equalogic is a reliable. It happens that I have make a cluster Oracle and I realized that I need two of Equalogic when I wanted to upgrade the firmeware of Equalogic. In fact, I had to reboot the Equalogic well as on the cluster. So I wondered how I could make to ensure continuity of service for my cluster oracle? Now I know that Equalogic must restart after an update, I think that with a second Equalogic I could replicate the SAN data A to B to update the A. Then comes the turn of B. The problem extends that I should not use more than half the capacity of the SAN. Regards' Falconstor, Datacore, and Reldata sell true storage virtualization products that can fron your sans and make this happen. It looks like we may actually purchase the Reldata stuff. The gave us a unit for a couple of months and the testing showed they were indeed impressive units for the money. The biggie here with Reldata for us? Unlimited Storage management / virualization is included with the standard license. Third part storage devices support seemed increadible.

FluidFS

Equallogic single point of failure?

Was this post helpful?