Yesterday I had a good exercise in a customer environment --> Complete power down of the building.
UPS attached to PDU 1
Public power supply on PDU 2
EMC VNX DPE attached to PDU 1 and PDU 2
DAEs attached to PDU 1 and PDU 2
So we completly cut the power from the public supply:
- Each DAE lost power on one Power supply
- DPE lost power on one Power supply
...... the system run on UPS battery for 30 minutes. After 30 minutes the UPS battery was empty and the DAEs shutdown immediatly the DPE run on its own battery for something like 30 seconds than it shutdown as well.
Before that all ESXi Hosts and its related VMs shutdown before the EMC powered down.
I always though that when you need to powerdown the DPE has to go down first so that no data is accepted anymore. And start the DAEs first so that they are ready for data from the DPE later on.
What we saw in this setup here was that the DAE went down first and than the DPE. In my logic this is bad because maybe the DPE wants to write data and the DAEs are already gone.
Here it did not happen because all the host were down before.
1) Is my first statement true with the correct order for shutdown and power-up?
2) Is our setup wrong or OK?
3) Is there any ways to achieve my goal (dpe shutdown first and than dae) with our setup or another setup?
4) Or is the way it went through just the right way (dae shutdown first and than dpe)?
Looking at you description of the event everything was working as expected. Keep in mind the battery inside the VNX itself is not a ups it's only purpose is to have enough power to safe the content of the write cache down to the vault drives. If this is done and power is not restored the dpe powers off. Looking at you decription the write cache was already disabled due to the loss of power on one side. That means the write cache was disabled and no cached IO's in memory.
Check this KB Which events will cause VNX MCx to disable write caching? Article Number 000442868
The DPE has bettery backup to permit it to destage any uncommitted writes to areas on the vault drives. That is how data integrity is insured...or at least given its best chance...
When you bring the array back up, power on all the DAEs, then the DPE. During its startup the flare will 'find' the uncommitted updates and attempt to complete them.