Storage Controller reboot and CPU utilization

Question

I have a 2-Xbrick configuration. EMC support had to reboot X1-SC2 and when they did i saw very large spike in CPU utilization on the other controller for that X-brick. Question:

1) If X1-SC1 is simply taking over the workload for X1-SC2 then i would expect it to go up to maybe 30%, not 90 % ?

2) When X1-SC2 is back online why it's CPU is at 50% for the next 20 minutes ?

3) Could this big spike in CPU cause an outage for systems connected to that x-brick ?

4) What does E1 and E2 stand for ?

Thank you

ChrisPy2 · Accepted Answer

I know this is a bit old but figured I would provide an answer in case someone else sees this. E1 and E2 stand for the xenv or processes that run the data modules that handle the IO flow. When a module is failed over due to a reboot or failure of another storage controller or even at the module level it will cause the others to have to proactively load the journals and metadata from the down modules. The reason you see this spike to 90+% is because we allow the module to use as much CPU as needed to load them as quickly as possible. once the load has finished you will see the CPU utilization level off. Again, once we failback when the module comes back online we have to go through this same load causing the CPU spike. These spikes do cause performance degradation on the array but should only last for 20 minutes at the most. In 4.x code, this can be monitored with the 'show-data-protection-groups' command. Typically hosts will only see a small latency spike but if the array is heavily utilized during the time of the failover or failback and does not have enough resources to cover the metadata loading it will suffer on the host side through increased latency.

dynamox · Answer

Thank you Chris

XtremIO

Storage Controller reboot and CPU utilization

Was this post helpful?