PowerFlex VMware Replication Causes High CPU Utilization And IO Errors
Summary: When running VMware Replication with PowerFlex SDCs, the ESXi host experiences high utilization and IO errors During the initial replication of VMs with VMware Replication using a PowerFlex cluster, the ESXi host sees high CPU utilization and IO errors. ...
Symptoms
- VMware Replication 8.4 and below
- Initial replication on a VM or VMs
- Replicated VM has many VMDK disks (15+)
- High CPU utilization on the ESXi host where the VM is hosted when replication begins.
- Latency on mapped volumes from PowerFlex cluster increases in latency to 20-30 ms, possibly more.
- Other VMs on the same host that is not being replicated may see decreased performance and/or IO errors from the application perspective.
- A view of disk queues with "esxtop" shows that the host is queuing IO calls to the backend volumes.
- The backend components (MDM/SDS) are solid and do not show any performance issues or errors.
- ESXi host with replicating VMs has these messages shortly after replication begins:
2021-05-19T17:58:08.413Z cpu70:2098596)WARNING: ScsiDeviceIO: 1564: Device eui.1309fbc714390806ba291d4e0000001b performance has deteriorated. I/O latency increased from average value of 796 microseconds to 25965 microseconds.
2021-05-19T17:58:10.048Z cpu70:2098596)WARNING: ScsiDeviceIO: 1564: Device eui.1309fbc714390806ba2944570000005d performance has deteriorated. I/O latency increased from average value of 799 microseconds to 26019 microseconds.
2021-05-19T17:58:12.060Z cpu70:2098596)WARNING: ScsiDeviceIO: 1564: Device eui.1309fbc714390806ba291d3d0000000a performance has deteriorated. I/O latency increased from average value of 676 microseconds to 23641 microseconds.
Impact
Performance degradation and IO errors from the application perspective
Cause
During the initial replication of a VM with VMware Replication, it does a checksum of every block for each .vmdk disk the VM has configured. During this checksum process, the IO is sent through a single thread on the ESXi host, causing the checksum IO to be serialized. This thread is also used for other IO purposes on the host, causing abnormal CPU utilization and disk latency which in turn slows down other VMs on the same host.
Resolution
VMware is fixing this in a later version of VMware Replication. The version is still TBD.