VPLEX: Severe performance issue due to aged lost pings causing a WAN-COM path to be discarded.
Summary: This article talks to an issue introduced in the new UDCOM feature that tracks lost pings count not being aged properly leading to the affected path not being used for IO traffic.
Symptoms
In GeoSynchrony 6.0.x VPLEX introduced a new layer called UDCOM which sends pings to peer directors every 100ms. For every lost ping an attribute called "loss'' will increment.
The Lost pings count "Loss attribute", as noted in the debugTowerDump logs, is used by the UDCOM path selection algorithm to calculate the path score. The path with the lowest score will be chosen for the data transmission.
Lost UDCOM pings are not being aged correctly, causing the affected path not to be used for transmission and only half of the available WAN-COM paths will be used.
With only half the available WAN-COM paths, VPLEX can exceed bandwidth limitations on the existing WAN-COM paths causing severe latency impacting the remote writes of distributed devices.
Cause
Resolution
Permanent Fix:
Issue is permanently fixed in GeoSynchrony 6.0 Service Pack 1 Patch 4 (6.0.1.04.00.09) and later.
How to identify the if this issue is occurring:
From the VPLEX GUI monitor tab, click on Performance. Check to see if the WAN tab is present. If not click on the '+' symbol after the last tab listed and then find and click on 'Add WAN Dashboard'. This will then list the WAN tab. Check if the WAN latency graph shows spiking of abnormal levels of 5ms or more. If these spikes are just once in awhile this is okay, yet if these spikes are constant or the graph goes to or above 5ms and remains there then this is an issue.
If the WAN-COM latency is being suspected to be the cause of the high latency on the hosts using distributed devices, please open a live chat with Dell EMC support and mention this article.