Problem with VNX5200

Question

Hi, my name is Pablo Venini and I’m CTO of Mercado de Valores de Rosario S.A., an stock exchange based in Rosario, Argentina. We bought  a VNX5200 array as part of a project to modernize our IT infraestructure, along with high performance servers, to support our expansion plans.  We choose EMC over its competitors because we needed an array capable of surviving failures  (we ran a trading system that is connected to the nationwide trading infrastructure, and monitored by government enforcement agencies); and the VNX 5200 offered dual active/active controllers, mirrored drives, redundant power sources and interfaces and remote proactive monitoring (we acquired a premium support offering from EMC). This array was installed in an APC rack in a  datacenter with dual online UPS, dual diesel generators, dual redundant precision AC and environment monitoring. We had an initial array that after some months of being installed started to show an overtemperature warning that led to the array being shutdown and restarted immediately; this happened every five minutes all day and prevented our servers from working with the data stored in the array and eventually destroyed all its disks due to the tear and wear related to this situation. This failure was never explained nor resolved, even after the whole chassis and its parts were replaced (the overtemperature alert was verified by the onsite service representative in situations when our datacenter was less than 20°C ambient temperature).  We were offered a mechanical replacement that was installed in the same site and worked well for a few months until on 12-14-2015 the array gave an overtemperature warning and shutdown itself and immediately restarted with half it LUN’s down (neither the ambient monitoring system gave an alert nor any other equipment in the datacenter had any thermal alert, we also have an emergency shutdown system that works on overtemperature conditions and it didn’t trip), this problem was solved by technical support and the array worked for a week until on 12-21-2015 the array went down completely, taking out our whole infraestructure in the middle of the trading session. Technical support connected remotely and could only manage to bring it online after 5 hours of working with it, only for the array to fail again 8 hours later. This prompted another lengthy session with technical support and another failure 8 hours later. What technical support found was that 1 system disk and 2 data disks where showing high temperatures (65°C), this overtemperature prompted the array to shutdown itself and the prevented it from booting again. Strangely, the disks sitting next to the overheating disks were showing a temperature of 20°C.  This was again verified by the local representative when the ambient temperature was below 20°C. Technical support directed for the disks to be replaced, and only after replacing the system disk that had failures the array stopped shutting down itself (this happened on 12-25-2015). All this situation is complicated by a lot of administrative problems we have had with EMC, which includes: -The first array was wrongly associated with a different firm (Mercado a Termino de Rosario) -All our EMC equipment and services where assigned to different administrative sites, even though they are on the same place -The ESRS we had assigned in our administrative site in fact was from another firm, so these arrays were never  proactively monitored (this was discovered a week ago) -The second array wasn’t registered properly, so we had to open all our service requests under the ID of the first array (which was never deleted from our site) Due to the instability of the array we were forced to move all our workloads to an alternative environment (an environment that lacks the processing power we had in the previous one and is not suitable for extended operations), and we have been in this situation since 12-21-2015 as we are hesitant to use the array again since we have no more confidence on it. Right now we are waiting for a concrete explanation on behalf of EMC as to why two arrays failed the same way, and we also want to know why an equipment that has redundancy in all its parts fails catastrophically when one of its components fails (a system disk that is mirrored and used by a service processor that itself is also redundant). In case we don't receive a quick and satisfactory answer we are going to return this equipment in exchange of the money we paid, and claim damages for all the inconveniences our firm suffered. Pablo Venini CTO Mercado de Valores de Rosario pvenini@mervaros.com.ar 54 0341 4210125

TimH · Answer

I would post this into the VNX Support Forum located at:VNXCheers~

ryanbrancel1 · Answer

Hi Pablo,

It is unfortunate to hear about this scenario that occurred with your VNX. As a field tech with EMC for 10 years, and a specialist in VNX products, I usually see site environmental factors that cause this. I'm not sure if you have a temperate sensor in front of the VNX DAE to help monitor? While it seems like it would take a lot of I/O, it could also be possible that certain disks are running hot because they are overloaded. Has anyone performed a performance analysis on your array to determine balanced workload is occurring? Or perhaps higher I/O LUNs need to be moved to SSDs, etc. You can also request a failure analysis on components by VNX engineering to be performed if needed. I'm not aware of any code bugs for thermal reporting, though we do recommend to upgrade to the latest target codes at least twice per year. The current target code levels for VNX5200 are File 8.1.8.121 and Block 05.33.008.5.119. Also, are you set up to receive email alerts from the VNX system? If so, you will also receive a weekly "array is alive" email notification to ensure alerts can be sent out. It could have been a VNX hardware issue that was unable to notify EMC or yourself, so proactive measure could not occur. When systems are deployed in ESRS, it will monitor the storage processors with a heartbeat signal (like a ping test) to ensure remote connectivity in. However the individual device alerts going out to EMC are set up in each product. For VNX you can set up alerts to go through your SMTP server or through the ESRS server. So if a product is deployed on ESRS, the device alerts still need to be defined separately. Just wanted to clarify as that is often misunderstood.

Regarding the account administrative issues, I know that can be a problem for customers when it occurs. So we do have local Account Service Engineers (ASE) / Account Service Representatives (ASR) / Customer Service Engineers (CSE) assigned to your account that would assist with this process. This would also be something that your EMC District Service Manager can help escalate to resolution.

Ultimately, we would recommend to work with your local sales and service account teams to assist in this process of updating records and ensuring device alerting is set up. Usually this process is started by creating a service request to engage local support. Though if there are issues, we can also provide some assistance here.

Ryan Brancel

EMC Account Service Engineer

pvenini · Answer

Ryan, thanks for your response. Yes, we have a temperature sensor (a Netbotz unit) sitting 1m ahead of the rack front; it sends alerts whenever we have an overtemperature event; we also have an analog thermostat wired to the EPO input of each of our UPSs; so if the ambient temperature rises above a threshold the UPS get turned off. We also have dual on-line ups and dual precision ac cooling, all backed by dual diesel generators . The rack is an APC server rack and the room is not often accesed phýsically as our servers have remote management cards. The high temp was measured without load (it showed in one system disk and two data disks) while the adjacent drives were at 20°C; this was measured with an EMC representative on site who also measured that the ambient temp was ok.

What baffles me (apart from the temp measures) and find most inexcusable is that the VNX shutdown itself when it could perfectly work (we had a spare disk which could be used to recover the failing drives) and only one system drive was showing problems (this drive is mirrored). Also, the other sp was ok, and instead of trespassing the LUN's, it elects to turn off (taking the whole infraestructure down).

The administrative problems we had with EMC deserve a whole new chapter; there was not a single step in our relationship that was trouble free. I lost count of how many times I reported problems, been told that they were fixed only to find out they weren't. I've been contacted regarding equipment that don't belong to us, our contact data was handled to another unrelated firm; the sites were mangled, and the icing of the cake is that the ESRS we were assigned was from another firm (so our vnx was never proactively monitored). Two weeks ago an EMC representative showed up to configure ESRS but I was not told about the device alerts neither I receive the weekly "array is alive" (the test emails worked ok at the time);

Unluckily the local sales and service reps move at a speed we find is not suitable to our needs; it's has been nearly a month since we had this problem (and opened a SR) and two weeks since the parts were sent to failure analysis, and we don't even have an ETA on when we will have answers from EMC; meanwhile we are forced to use alternative equipment wich is no really suitable to our needs.

Pablo

pvenini · Answer

Hi Javier, we hope we have a timely answer to all the concerns we voiced at the meeting, and a fullfillment on all the commitments you made. As of now we are still waiting for a timetable for the failure analisys.

Yours

Pablo

VNX

Was this post helpful?