Hello VxRAIL Experts:
Two weeks ago, the DataCenter of one us customers suffered a Power Outage. They have a VxRAIL with two appliances and four Quanta Nodes. Of course, all the VxRAIL solution (nodes, appliances and switches) were abruptly power off due the Power outage.
When the electric power was restored, the Nodes booted up well, but once in vmware, the VSAN, was completely undone and after a SR opened and support assistance by one week, the only option at the end was re-install the VxRAIL from zero.
Please, could you say me if is there any procedure or document about how to handle power outages with VxRAIL ? ... Or definitively is could happen again.
This is a really scary scenario. Were you able to recover the data on the VSAN?
Rebuilding the cluster seems to be the default for any issues experienced on the VXRails.
We bought an 2 appliance, 3 node Quanta system which literally took 10 months to get physically working. This included replacing the entire hardware bill-of-materials and re-imaging the hardware almost 4 times, then rebuilding from scratch.
Eventually the configuration got to a point where the setup was complete and everything showed healthy. When we tried to integrate Active Directory, we were told the entire system had to be reset from scratch as AD needed to be set up from the initialization page and could not be changed later on. This was done, bringing the entire build up to 12 months now.
We have not migrated any systems to the VXRail yet, but once we do, the scenario above would not be acceptable and is a huge concern.
I would really appreciate letting me know what was recovered and how this was done. If you do not want to post here, please let me know and I will give you an email address to send a private message.
Unfortunately, Nothing could be recovered. Fortunately, the VxRAIL in our customer is the Recovery Site of a VMWARE Production Site with RP4VM. Therefore, we lost all the replication done (aprox 22 TB of 35 VMs). These days, we are in the process of replication again in order to have everthing like before of the power outage. But the bad thing is our customer is very concern about the "Stability" of the VxRAIL an so we.
@OctavioGM - I can by planned experience about loss of power, recovery for HCI is going to be painful and not surprised if fatal.
Prior to my VxRail deployment I lead a Proof of Concept and bake off between Nutanix and VxRail - full installations with test VMs. One of my critical POC tests was full loss of power; simply pulling the power cords out of the Nutanix and VxRail appliances.
Both brands suffered tremendously! Nutanix did not recover at all, total loss of use. VxRail or vSAN eventually recovered but with damage - hours to become functional again without intervention. My POC conclustion to management was "Plan for the worst, hope for the best". These data center worthy systems need more catastrophy proof engineering
Converged infrastructure technology has many years to go before fully matured and resilient to the unexpected like uncontrolled loss of power.
You are not alone my friend, plan well for the unexpected
As an IT professional who is in the process of migrating into a 7 node VXRail E460F cluster as my primary data center, this is extremely concerning.
Would those troubleshooting steps apply 100% to a VXRail cluster all hosts down scenario?
Power conditioning and redundancy is very important for VxRail, only a few of the many risks that should be assessed.
I don't know how many appliances you are deploying, one is too risky. I have 2, one is production the other is backup or fail-over. I have RecoverPoint 4 VMs deployed, it is helpful but not the end-game. Better option is to learn more about vSAN Stretch Clusters
Like many going with HCI, piling on dozens of VMs into one appliance without considering the risk of a complete failure without a live secondary solution is a huge mistake - data centers do fail, planned and unplanned
If you deploy at least 3 VxRails as 1 cluster some of the anxiety is relieved. Having all Rails in the same physical space calls for business continuity review
Thank you Keith for your comments. I must to say that a second VxRAIL like recover option should be considered. This week I have a meeting wit our Presales personnel to evaluate the best recovery options and avoid have only one solution in the customer for Production environments. Thank you everyone for all your comments and answers.