VPLEX: Component failures in fabric or array controller leads to performance data unavailability

Summary: This article talks about how to mitigate issues related to a single component failure that may impact performance in a VPLEX environment.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Instructions

Issue Summary
End users may experience severe impact on some, or all, hosts connected to VPLEX from issues such as slow drains, array target controller faults, CRC errors, switch ASIC faults, switch reboots, etc. The VPLEX back-end utilizes a round robin policy that may cause issues on one fabric to impact all host paths on that fabric (or may affect paths on the other fabric as well).
  
For switch and array teams   
If an end user is reporting wide spread impact as a result of a single component failure, slow drain, etc. check with the end user to see if VPLEX is in the environment. If VPLEX is in the environment, and the extent of the problem is known, request that the end user block the affected path(s) on the switch. If VPLEX is in the environment and the affected paths are not known, engage Dell EMC Customer Support, explain the issue, and mention this article.
 
For the VPLEX Team 
If there is an SR where the end user is reporting ongoing impact and it is suspected the cause is due to poorly performing back end paths, identify the poorly performing paths and block them in VPLEX. If the affected paths are not evident, engage a coach for assistance. Switch and array collaborations can be done once the impact has ended.
 
Background
VPLEX to Array I/O Flow
VPLEX operates much like a clustered host environment. Each director, which receives I/O from the host, is responsible for completing that I/O. Each director has multiple paths across both fabrics to each LUN. Each VPLEX director is responsible for balancing the I/O across all the available active paths.
 
VPLEX Fault Detection and Mitigation
The primary method VPLEX uses for detecting and mitigating path faults is to monitor the ratio of timeouts on each path. If 90 percent of the I/O times out in two consecutive 15 second periods, VPLEX will banish the affected path and no longer use it. VPLEX will then periodically probe the banished path and un-banish it if I/O is seen again successfully on this path.
 
How Problems Can Arise
Due to the high threshold for path banishing, frequent probing, and the low threshold for unbanishing the path, unhealthy paths may continue to be used by VPLEX. The result is that VPLEX may send a significant amount of I/O through poorly performing paths or paths which have experienced soft faults. This I/O either times out or takes an excessive amount of time to complete. The result is significantly elevated response times across all host paths. This may result in performance data unavailability for any or all hosts connected to the VPLEX.


Recommendation
Upgrade to VPLEX GeoSynchrony target code 6.2 P3 or later for improved relief/handling of the above conditions. Refer to release notes for more details about back-end path management functionality.

Affected Products

VPLEX Series

Products

VPLEX for All Flash, VPLEX Series, VPLEX VS2, VPLEX VS6
Article Properties
Article Number: 000157795
Article Type: How To
Last Modified: 03 Jun 2025
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.