We have many Dell M1000e blade chassis at multiple locations which experience CMC fail overs as evidenced by this error: CMC redundancy is lost. We have Brocade M5424s switches on C1 and C2 and Cisco Cisco Catalyst WS-CBS3130X-S switches on A1, A2, B1, and B2 and redundant CMCs in the chassis. The CMCs are on firmware version 3.21 and there are a mix of all models of Dell blades in the chassis. We use 3PAR storage. These fail overs seemingly happen at random and can cause storage or networking to go offline at the time of the fail over. Has anyone else experienced this? We've been working with support and making little progress. Thanks!
We are seeing the exact same thing. We are on firmware 3.10 but all other symptoms are exactly the same. We are running KR pass-through with Brocade BR1741m-k mezzanine CNA cards to Brocade VDX switches. Any thoughts on this?
What type of storage do you use? We have several models of 3PAR arrays.
Do you have a case open with Dell for this issue? We are a large Dell customer and have had a case open with them for months and they state they have not seen this issue anywhere else (essentially blaming our environment without being able to find a specific cause). If you have a case open or could open one and provide the case number that would help us in further escellating this issue with Dell. Feel free to PM if you are not comfortable posting publicly. I can also share our case number if you would like that. Thanks!
Ill give a quick update. We did open a case and they really just looked at vCenter logs and CMC logs and didnt come up with a lot. I had this happen again over the weekend and noticed it was only my esxi 4.1 hosts that lost the storage/network my esxi 5 hosts were fine. I forced a failover to test and got the same results. I upgraded another cluster to vshpere 5 on blades that were in one of the chassis and tested again and they did NOT loose network or storage. It may have be a drive issue but I was able to repeat the process and I am getting no storage/network loss on any version 5 hosts in a CMC failover/heartbeat failure. Support did tell me that CMC versions 3.x had a problem false positive heartbeat failures. Its not a root cause but does trigger the issue we have been seeing. I upgraded 2 of my chassis to CMC version 4.0 to see if we get anymore false positives even though I am upgraded to esxi v5 across the board now
Just another thing to compare notes. What mezzanine cards are you using in fabric B and C? We are using Brocade BR1741m-k CNA in both fabrics.