VPLEX: How to diagnose and monitor Back-End issues with the improved Back-End Path Management (BEPM) in GeoSynchrony 6.2
Summary: This article discusses the Back-End (BE) Path Management function that was added to GeoSynchrony 6.2 and later. It also discusses the handling of Back-End network congestion through its changed BE Path Management functionality. ...
Symptoms
In pre-GeoSynchrony 6.2, VPLEX failed to isolate some paths to Back-End storage volumes experiencing high latency due to fabric congestion or storage array issues. In version 6.2, VPLEX automatically isolates poorly performing paths to storage volumes (paths experiencing high latency) and automatically recovers them when performance recovers to acceptable levels.
New Call Homes introduced in GeoSynchrony 6.2:
0x8a6b6001 - bepm/1 - The performance of a back-end IT nexus has fallen below acceptable levels.
0x8a6b6004 - bepm/4 - Repeated periods of poor performance have been detected on a back-end IT Nexus.
0x8a6b6007 bepm/7 - All Initiator-Target-LUNs (ITLs) to a logical unit on this director experience poor performance, so the logical unit is marked degraded.
Back-End paths are made up of an Initiator (VPLEX BE port) and a Target (port on the array) and are referred to as an IT Nexus.
New CLI Commands to check what BE paths may be in a degraded state:back-end degraded list and back-end degraded recover
Refer to the VPLEX 6.2 CLI Guide for more information about the use of these commands.
The back-end degraded list command shows any I-Ts that VPLEX has degraded due to high latency.
Below shows the use of the help (-h) option with the command to see how the command can be used:
VPlexcli:/> back-end degraded list -h
synopsis: list [<options>]
This displays a list of degraded I-Ts:
options (* = required):
-h | --help
Displays the usage for this command.
--verbose
Provides more output during command execution. This may not have any effect for some commands.
-g | --group-by= <group_by>
Group degraded I-Ts by the specified field. Supported fields: array, director
Example of ITs marked degraded - displayed with Degradation Reason of Degraded performance:
VPlexcli:/> back-end degraded list
Degraded I-Ts:
Director Director Port Initiator Target Array Degradation Reason
-------------- ------------- ------------------ ------------------ --------------------------- --------------------
director-1-1-A A1-FC00 0xc00144878f110800 0x50060160086429bb EMC-CLARiiON-APM00140624008 Degraded performance
A1-FC00 0xc00144878f110800 0xc00144878f3d0000 EMC-Invista-LABRATS4900007 Degraded performance
A1-FC00 0xc00144878f110800 0xc00144878f3d0200 EMC-Invista-LABRATS4900007 Degraded performance
A1-FC01 0xc00144878f110900 0xc00144878f3d0100 EMC-Invista-LABRATS4900007 Degraded performance
A1-FC01 0xc00144878f110900 0xc00144878f3d0300 EMC-Invista-LABRATS4900007 Degraded performance
director-1-1-B B1-FC00 0xc00144878f118800 0xc00144878f3d0000 EMC-Invista-LABRATS4900007 Degraded performance
B1-FC00 0xc00144878f118800 0xc00144878f3d0200 EMC-Invista-LABRATS4900007 Degraded performance
B1-FC01 0xc00144878f118900 0xc00144878f3d0100 EMC-Invista-LABRATS4900007 Degraded performance
B1-FC01 0xc00144878f118900 0xc00144878f3d0300 EMC-Invista-LABRATS4900007 Degraded performance
If a back-end IT path is found to cycle between degraded and un-degraded (flapping) three times within a 30-minute period, then the IT Nexus is considered unstable and the VPLEX automatically stops using the IT Nexus for host-based I/O and reports the call home event 0x8a6b6004 or bepm/4 in the firmware logs. When in this state the back-end degraded list command lists Degradation Reason as Isolated due to unstable performance.
In this case the IT Nexus will remain degraded until either the end user manually restores it by using the CLI command back-end degraded recover, or the four-hour default threshold is reached, after which the IT Nexus is marked Performance degraded while the recovery process checks its health before un-degrading it (and automatically re-enabling the path to serve host-based I/O again if the performance tests pass).
Example of the unstable state (intermittent performance degradation) the Degradation Reason is Isolated due to unstable performance:
VPlexcli:/> back-end degraded list
Degraded I-Ts:
Director Director Initiator Target Array Degradation Reason
-------------- Port ------------------ ------------------ ------------------------------- -----------------------
-------------- ------------ ------------------ ------------------ ------------------------------- -----------------------
director-1-1-A A1-FC00 0xc001448798b90800 0x5000097398037804 EMC-SYMMETRIX-197600222 Isolated due to unstable performance
A1-FC00 0xc001448798b90800 0x5000097398037805 EMC-SYMMETRIX-197600222 Isolated due to unstable performance
If no paths are degraded, then the back-end degraded list command reports this:
VPlexcli:/> back-end degraded list
No paths are currently degraded.
The other new CLI command, back-end degraded recover, is for degraded back-end paths. Below shows the use of the help (-h) option with the back-end degraded recover command to see how the command can be used:
VPlexcli:/> back-end degraded recover -h
synopsis: recover [<options>]
Recovers the specified degraded I-Ts:
options (* = required):
-h | --help
Displays the usage for this command.
--verbose
Provides more output during command execution. This may not have any effect for some commands.
-p | --paths= <paths>
The degraded I-Ts to recover. Each I-T must be expressed as a pair in the form "(<initiator>,<target>)".
--all
Recover all currently degraded I-Ts.
Example of single I-T for recovery, displayed with Degradation Reason of Isolated due to unstable performance:
VPlexcli:/> back-end degraded recover -p (0xc00144878bda0900,0x5006016547e01af9)
Recovered I-Ts:
Director Director Port Initiator Target Array Degradation Reason
-------------- ------------- ------------------ ------------------ --------------------------- ------------------
director-1-1-A A1-FC01 0xc00144878bda0900 0x5006016547e01af9 EMC-CLARiiON-APM00164919257 Isolated due to unstable performance
Example of all degraded I-Ts to be recovered:
VPlexcli:/> back-end degraded recover --all
Recovered I-Ts:
Director Director Port Initiator Target Array Degradation Reason
-------------- ------------- ------------------ ------------------ --------------------------- ------------------
director-1-1-A A1-FC00 0xc00144878bda0800 0x5000144260321e00 EMC-Invista-rc-surry-1 Isolated due to unstable performance
director-1-1-B B1-FC01 0xc00144878bda8900 0x5006016547e01af9 EMC-CLARiiON-APM00164919257 Isolated due to unstable performance
If the intermittent latency issue continues for the impacted IT Nexus, and the user cannot address the root cause quickly, then it is advised to engage VPLEX Customer Support, by Live Chat, to manually mark the IT Nexus degraded to remove the path from use until the underlying issue can be resolved.
Cause
Issues external to the VPLEX, like fabric congestion or array issues, can lead to back-end issues for the VPLEX. While GeoSynchrony 6.2 is designed to better handle these kinds of BE congestion, it is recommended that the congestion be resolved as soon as possible.
To detect the source of the congestion, Dell has an FC port monitoring feature that monitors for errors in the fabric of any BE FC port. The results can help to narrow down network issues in the fabric. As of GeoSynchrony 6.2, the FC Port Monitor is on by default.
If you are still running any version of GeoSynchrony 6.0.x or 6.1.x and are not yet ready to upgrade to 6.2.x yet you would like the FC Port Monitoring script loaded on your VPLEX, contact Dell Customer Support to load the script.
Resolution
GeoSynchrony 6.2 was designed to better handle this type of network congestion. When the BEPM feature of the VPLEX code is triggered, it indicates issues external to VPLEX. The cause of the network congestion or storage array issues should be repaired immediately. The data provided by the VPLEX logs can be used to help narrow down where the issues occur. Once the problem is repaired VPLEX auto restores the I-Ts that will now be healthy.