VPLEX: For VS6 Hardware Possible Total Cluster Outage (TCO) during NDU I/O Forwarding Phase due to FCID differences
Summary: During the I/O forwarding phase of NDU, VPLEX might experience a total cluster outage If an Initiator logs into the same target port with different nPortId's. In a GenU action, if an initiator logs into the same target port with a different nPortId, this can also impact the GenU. This article is asking users to ensure that if they are using NPIV, that they enable persistent FCID on their SAN Switches so that when host HBAs log back in during the NDU, or if a host logs out and back in during a GenU, they do so with the same FCID they had previously. ...
Symptoms
Impacted VPLEX Hardware:
EMC Hardware: VPLEX VS6
EMC Hardware: VPLEX-Local
EMC Hardware: VPLEX-Metro
NOTE:
This KB applies only to the VS6 Hardware.
The issue this article is about is not related to the VS2 Hardware
Impacted VPLEX GeoSynchrony code versions:
EMC Software: GeoSynchrony 6.0 Patch 1
EMC Software: GeoSynchrony 6.0 Patch 2
EMC Software: GeoSynchrony 6.0 Service Pack 1
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 1
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 2
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 3
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 4
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 5
Impact: This issue is limited to VS6 systems on 6.0 Patch 1 through 6.0 SP1 P5 (with potentially greater risk in a AIX environment, yet VMware environments may also be susceptible.)
This issue does not affect VS2.
During the I/O forwarding phase of an NDU, VPLEX might experience a total cluster outage if an Initiator logs into the same target port with a different nPortId. Though there is no I/O Forwarding phase in the GenU action, there is the chance a host(s) could logout and then log back in with a different nPortID which is where the VS6 is susceptible to this issue.
This issue can cause an NDU failure and rollback, or in the case of a GenU can cause one or more directors to fail.
This issue can also cause all second upgrader directors to fail which could result in a Data Unavailable (DU) event and potentially a Total Cluster Outage (TCO).
Symptoms:
Output from failed NDU session
================================================================================
[Tue Jul 18 17:57:36 2017] Post NDU tasks
================================================================================
Please verify system health by issuing 'health-check' and 'ndu pre-check' commands.
Warning: Unable to restart perpetual monitors. Please restart manually: 'java.lang.RuntimeException' object has no attribute 'get'
NDU roll back of 1st upgraders failed. Investigate automatic recovery failure before attempting manual recovery
/engines/engine-2-1/directors/director-2-1-A has missing peers: {u'numSits': u'12', u'reachableSameVersionNodes': [u's2_01fc_spa', u's2_0293_spa', u's2_0294_spa'], u'fullState': u'0x83', u'reachableCrossVersionNodes': [], u'comUuid': u's2_0238_spa', u'maxIOsize': u'1028', u'configuredNumPeers': u'2', u'numOutstandingIOsForwardToPeers': u'0', u'peerNodes': [], u'fullStateFlags': u'00000000000000000000000010000011', u'site': u'2', u'enabledState': u'enabled', u'numOutstandingIOsReceivedFromForwarders': u'0', u'numVSits': u'0', u'runningState': u'running'}
Timeout after waiting 600 seconds for directors to be stable
================================================================================
The output for 'ndu start' has been captured in /var/log/VPlex/cli/capture/ndu-start-session.txt
ndu start: Evaluation of <<ndu start -u /tmp/VPlex-6.0.1.04.00.07-director-firmware-package.tar --skip-cluster-status-check --skip-view-health-check
--skip-view-config-check --skip-meta-volume-backup-check --io-fwd-ask-for-confirmation on-missing-logins>> failed.
cause: Command execution failed.
cause: NDU roll back of 1st upgraders failed. Investigate automatic recovery failure before attempting manual recovery
cause: /engines/engine-2-1/directors/director-2-1-A has missing peers: {u'numSits': u'12', u'reachableSameVersionNodes': [u's2_01fc_spa', u's2_0293_spa',
u's2_0294_spa'], u'fullState': u'0x83', u'reachableCrossVersionNodes': [], u'comUuid': u's2_0238_spa', u'maxIOsize': u'1028', u'configuredNumPeers': u'2',
u'numOutstandingIOsForwardToPeers': u'0', u'peerNodes': [], u'fullStateFlags': u'00000000000000000000000010000011', u'site': u'2', u'enabledState':
u'enabled', u'numOutstandingIOsReceivedFromForwarders': u'0', u'numVSits': u'0', u'runningState': u'running'}
cause: Timeout after waiting 600 seconds for directors to be stable
For some reason director 2-1-A has missing peers
Firmware logs contain: Director failure messages in stpl-iofwd.c
128.xxx.xxx.xx/cpu0/log:5988:W/"0060166fbf92144854-2":67300:<0>2017/07/18 17:18:23.61: utl/0 ASSERT: /export/local1/jenkins/clone_D50.10/nsfw/snac/stdf/ext/stpl-iofwd.c:stdfStpl_destroyITNexus/245: failed to detach it nexus
128.xxx.xxx.xx/cpu0/log:5988:W/"0060166fbf92144854-2":67301:<4>2017/07/18 17:18:23.61: floor/4 tower halting with status 1
128.xxx.xxx.xx/cpu0/log:5988:W/"0060166fcaa6144854-2":67416:<0>2017/07/18 17:18:23.61: utl/0 ASSERT: /export/local1/jenkins/clone_D50.10/nsfw/snac/stdf/ext/stpl-iofwd.c:stdfStpl_destroyITNexus/245: failed to detach it nexus
128.xxx.xxx.xx/cpu0/log:5988:W/"0060166fcaa6144854-2":67417:<4>2017/07/18 17:18:23.61: floor/4 tower halting with status 1
Firmware logs will show different nPortId from the same initiator:
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:17:27: io-port/37 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca48) type target is ready.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:17:27: stdf/17 FCP connection established. IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:17:40: io-port/37 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca49) type target is ready.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:17:40: stdf/17 FCP connection established. IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:17:40: io-port/38 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca48) type target is closing.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:17:40: stdf/18 FCP connection lost. IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:18:24: io-port/38 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca49) type target is closing.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:19:35: io-port/37 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca4a) type target is ready.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:19:35: stdf/17 FCP connection established. IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:34:36: io-port/37 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca4a) type target is ready.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:34:36: stdf/17 FCP connection established. IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)
Cause
Resolution
Permanent Fix:
The issue is addressed in GeoSynchrony 6.0 SP1 P6 and later. Upgrades to this and later VPLEX code levels are not vulnerable to this issue.
Workaround:
Fabrics assign an nPortId to each initiator. The following are the noted behavior for switches with respect to nPortID:
- Brocade switches running in FC-SW mode, without using NPIV will assign the same nPortId to an initiator port.
- Brocade switches running in FC-SW mode, using NPIV, can assign a different nPortId when an initiator re-connects to the fabric.
- Brocade switches running in AG mode can assign a different nPortId when an initiator re-connects to the fabric.
- Cisco switches running in FC-SW mode with the persistent FCID feature enabled will assign the same nPortId to an initiator port.
- Cisco switches running in NPV mode can assign a different nPortId when an initiator re-connects to the fabric.
The following are VPLEX Engineering recommendations to avoid this issue during an NDU or GenU:
- Fabric modes where an initiator will keep the same nPortId when logging off, then logging back in should be used.
- Brocade switches running in FC-SW mode with NPIV is use should follow the recommendations to activate WWN-based persistent PID assignment in Brocade Fabric OS Administration Guide http://www.brocade.com/content/html/en/administration-guide/fos-741-webtools/GUID-45FCE596-5D54-41D8-8E57-D3CD5FA85E44.html
- Brocade switches running in AG mode should follow the recommendation to enable persistent ALPA policy in Brocade Access Gateway Administrator's Guide http://www.brocade.com/content/html/en/administration-guide/fos-741-accessgateway/GUID-B71C975E-49FF-49CF-A32C-0AC4DA2A938C.html
- Cisco switches running in NPV mode should follow KB Link Error 320946 to enable the persistent FC ID feature.
- Brocade switches running in FC-SW mode with NPIV is use should follow the recommendations to activate WWN-based persistent PID assignment in Brocade Fabric OS Administration Guide http://www.brocade.com/content/html/en/administration-guide/fos-741-webtools/GUID-45FCE596-5D54-41D8-8E57-D3CD5FA85E44.html
- For the duration of the NDU, cable maintenance and fabric re-zoning should be avoided.
- In the case where AIX NPIV hosts are in the environment, shorter I/O forwarding times should be used.