Have you tested the system running just the secondary CMC?
If the system does not have a functional CMC then there are many things that occur. One of the effects of not having a functional CMC is that the blades will run on a low power mode to conserve energy and reduce thermal output since the CMC is not present to manage and monitor power and cooling.
Your scenario sounds like the primary CMC is failing for some reason and the secondary CMC is not functioning properly. I would troubleshoot this as being either a CMC or CMC slot issue. Test the secondary CMC in the primary slot to see if it functions normally. If it functions normally then test the system running off just a CMC in the secondary slot. If the system does not function correctly with either CMC in the secondary slot then you likely have an issue with the secondary slot.
When the CMC fails over to secondary, we are able to log in and manage everything. It appears both CMCs are working properly. Unfortunately our data center is 200 miles away from the office and so it's hard for us to troubleshoot with switching out slots.
So based on the information you gave me, that would mean the secondary CMC may be faulty in some way, and the servers are running on the low power mode? Even though we can still log into the CMC, it could still be faulty? Is there anyway to test the functionality of the CMC?
If everything is registering fine with the CMC and no errors are reported then no. Check the CPU frequency within the OS. If the CPU frequency is low then it is being throttled. This could be a CMC or ACPI issue. The blades should not have an issue with ACPI since the CMC failover shouldn't be detected at an OS level.
If throttling is occurring then CPU performance states should be throttled down. That could be the reason CPU usage is hitting 100%. You should be able to see that from within the OS.
So as an update, we had another failover tonight. We had THREE chassis (not in the same rack) that failed within 10 minutes of each other. In two of those chassis, All servers had load spikes and tipped over. As soon as we failed back over the CMCs, load dropped. Is there some setting we aren't finding somewhere that will disable this? I don't care if the CMCs flop back and forth all day, as long as our MISSION CRITICAL boxes stay alive and well.
No, you cannot disable throttling that occurs during CMC failover or if no CMC is present. If you want to resolve the issue then you need to troubleshoot the issue.
Pull and review logs
Restart one of the servers when this happens to find out if it is an ACPI issue within the OS during the failover
You should also try to remember when this issue started occurring and what changes were made at that time. It sounds like you have multiple chassis across multiple racks that are having this issue. Are all of them running the same CMC firmware? What version is the firmware? When all of these chassis experienced the issue within 10 minutes of each other what was going on? Were they performing a backup, was there a network issue at the site, or anything else that could help narrow down the cause.
You have two issues that you are troubleshooting. Issue number one is why the failover is taking place. The second issue is why throttling continues after the second CMC comes online, that is assuming that the performance issue on the servers is due to throttling. Have you checked the CPU frequency while the systems are having performance issues?
Daniel My
10 Elder
•
6.2K Posts
0
May 7th, 2015 14:00
Hello
Have you tested the system running just the secondary CMC?
If the system does not have a functional CMC then there are many things that occur. One of the effects of not having a functional CMC is that the blades will run on a low power mode to conserve energy and reduce thermal output since the CMC is not present to manage and monitor power and cooling.
Your scenario sounds like the primary CMC is failing for some reason and the secondary CMC is not functioning properly. I would troubleshoot this as being either a CMC or CMC slot issue. Test the secondary CMC in the primary slot to see if it functions normally. If it functions normally then test the system running off just a CMC in the secondary slot. If the system does not function correctly with either CMC in the secondary slot then you likely have an issue with the secondary slot.
Thanks
wdtmatt
4 Posts
0
May 7th, 2015 15:00
When the CMC fails over to secondary, we are able to log in and manage everything. It appears both CMCs are working properly. Unfortunately our data center is 200 miles away from the office and so it's hard for us to troubleshoot with switching out slots.
So based on the information you gave me, that would mean the secondary CMC may be faulty in some way, and the servers are running on the low power mode? Even though we can still log into the CMC, it could still be faulty? Is there anyway to test the functionality of the CMC?
Daniel My
10 Elder
•
6.2K Posts
0
May 7th, 2015 15:00
If everything is registering fine with the CMC and no errors are reported then no. Check the CPU frequency within the OS. If the CPU frequency is low then it is being throttled. This could be a CMC or ACPI issue. The blades should not have an issue with ACPI since the CMC failover shouldn't be detected at an OS level.
If throttling is occurring then CPU performance states should be throttled down. That could be the reason CPU usage is hitting 100%. You should be able to see that from within the OS.
Thanks
wdtmatt
4 Posts
0
June 4th, 2015 20:00
So as an update, we had another failover tonight. We had THREE chassis (not in the same rack) that failed within 10 minutes of each other. In two of those chassis, All servers had load spikes and tipped over. As soon as we failed back over the CMCs, load dropped. Is there some setting we aren't finding somewhere that will disable this? I don't care if the CMCs flop back and forth all day, as long as our MISSION CRITICAL boxes stay alive and well.
Daniel My
10 Elder
•
6.2K Posts
0
June 5th, 2015 11:00
You need to review the logs to get an idea of what is going on.
Disable what, disable throttling?
http://www.dell.com/support/Manuals/us/en/19/Topic/dell-cmc-v5.0-m1000e/CMCM1000e5.0_UG-v1/en-us/GUID-4C648550-470A-43B7-A18B-876B50ABA851
No, you cannot disable throttling that occurs during CMC failover or if no CMC is present. If you want to resolve the issue then you need to troubleshoot the issue.
You should also try to remember when this issue started occurring and what changes were made at that time. It sounds like you have multiple chassis across multiple racks that are having this issue. Are all of them running the same CMC firmware? What version is the firmware? When all of these chassis experienced the issue within 10 minutes of each other what was going on? Were they performing a backup, was there a network issue at the site, or anything else that could help narrow down the cause.
You have two issues that you are troubleshooting. Issue number one is why the failover is taking place. The second issue is why throttling continues after the second CMC comes online, that is assuming that the performance issue on the servers is due to throttling. Have you checked the CPU frequency while the systems are having performance issues?
Thanks