PowerStore 500T: CMI self-protect mechanism may lead to a short service disruption
Summary: PowerStore 500T, CMI self-protect mechanism may lead to an auto recovered Data Unavailable (DU) condition.
Symptoms
This issue only affects PowerStore 500T running PowerStoreOS version 3.2.1 (or later).
If a PowerStore node is not responding, the CMI link is disabled. This leads to a node reboot and in some rare cases the peer node may also unexpectedly panic leading to a short auto recovered DU.
Cause
PowerStore 500T leverages RDMA for nodes internal communication. The communication is done on a CMI link (PCIe NTB links).
On releases prior to version 3.2.1, there was an issue where a node that encounters a CPU IERR (CATERR) may propagate the CPU IERR to the peer node leading to a potential data loss (DL). For more information, see KB# 000213516 PowerStore 500T: Hardware failure may propagate to peer node leading to service disruption.
To fix the issue, a Watchdog is implemented in the BMC. When the Watchdog expires the BMC cuts the CMI links between the two nodes. This causes one of the nodes to be fenced and reboot.
The other node may sometimes experience starvations (inability to access resources) due to remote memory on the peer node is "cut."
Resolution
None. There is no workaround to avoid the service disruption.