VPLEX: For VS6 Hardware Possible Total Cluster Outage (TCO) during NDU I/O Forwarding Phase due to FCID differences

Summary: During the I/O forwarding phase of NDU, VPLEX might experience a total cluster outage If an Initiator logs into the same target port with different nPortId's. In a GenU action, if an initiator logs into the same target port with a different nPortId, this can also impact the GenU. This article is asking users to ensure that if they are using NPIV, that they enable persistent FCID on their SAN Switches so that when host HBAs log back in during the NDU, or if a host logs out and back in during a GenU, they do so with the same FCID they had previously. ...

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms



Impacted VPLEX Hardware: 
EMC Hardware: VPLEX VS6
EMC Hardware: VPLEX-Local
EMC Hardware: VPLEX-Metro

NOTE:
This KB applies only to the VS6 Hardware.
The issue this article is about is not related to the VS2 Hardware


Impacted VPLEX GeoSynchrony code versions:
EMC Software: GeoSynchrony 6.0 Patch 1
EMC Software: GeoSynchrony 6.0 Patch 2
EMC Software: GeoSynchrony 6.0 Service Pack 1
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 1
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 2
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 3
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 4
EMC Software: GeoSynchrony 6.0 Service Pack 1 Patch 5

Impact: This issue is limited to VS6 systems on 6.0 Patch 1 through 6.0 SP1 P5 (with potentially greater risk in a AIX environment, yet VMware environments may also be susceptible.)
This issue does not affect VS2.

During the I/O forwarding phase of an NDU, VPLEX might experience a total cluster outage if an Initiator logs into the same target port with a different  nPortId. Though there is no I/O Forwarding phase in the GenU action, there is the chance a host(s) could logout and then log back in with a different nPortID which is where the VS6 is susceptible to this issue.

This issue can cause an NDU failure and rollback, or in the case of a GenU can cause one or more directors to fail.

This issue can also cause all second upgrader directors to fail which could result in a Data Unavailable (DU) event and potentially a Total Cluster Outage (TCO).

Symptoms:
Output from failed NDU session
================================================================================
[Tue Jul 18 17:57:36 2017] Post NDU tasks
================================================================================
Please verify system health by issuing 'health-check' and 'ndu pre-check' commands.
Warning: Unable to restart perpetual monitors. Please restart manually: 'java.lang.RuntimeException' object has no attribute 'get'
NDU roll back of 1st upgraders failed. Investigate automatic recovery failure before attempting manual recovery
/engines/engine-2-1/directors/director-2-1-A has missing peers: {u'numSits': u'12', u'reachableSameVersionNodes': [u's2_01fc_spa', u's2_0293_spa', u's2_0294_spa'], u'fullState': u'0x83', u'reachableCrossVersionNodes': [], u'comUuid': u's2_0238_spa', u'maxIOsize': u'1028', u'configuredNumPeers': u'2', u'numOutstandingIOsForwardToPeers': u'0', u'peerNodes': [], u'fullStateFlags': u'00000000000000000000000010000011', u'site': u'2', u'enabledState': u'enabled', u'numOutstandingIOsReceivedFromForwarders': u'0', u'numVSits': u'0', u'runningState': u'running'}
Timeout after waiting 600 seconds for directors to be stable
================================================================================
The output for 'ndu start' has been captured in /var/log/VPlex/cli/capture/ndu-start-session.txt
ndu start:  Evaluation of <<ndu start -u /tmp/VPlex-6.0.1.04.00.07-director-firmware-package.tar --skip-cluster-status-check --skip-view-health-check
            --skip-view-config-check --skip-meta-volume-backup-check --io-fwd-ask-for-confirmation on-missing-logins>> failed.
cause:      Command execution failed.
cause:      NDU roll back of 1st upgraders failed. Investigate automatic recovery failure before attempting manual recovery
cause:      /engines/engine-2-1/directors/director-2-1-A has missing peers: {u'numSits': u'12', u'reachableSameVersionNodes': [u's2_01fc_spa', u's2_0293_spa',
            u's2_0294_spa'], u'fullState': u'0x83', u'reachableCrossVersionNodes': [], u'comUuid': u's2_0238_spa', u'maxIOsize': u'1028', u'configuredNumPeers': u'2',
            u'numOutstandingIOsForwardToPeers': u'0', u'peerNodes': [], u'fullStateFlags': u'00000000000000000000000010000011', u'site': u'2', u'enabledState':
            u'enabled', u'numOutstandingIOsReceivedFromForwarders': u'0', u'numVSits': u'0', u'runningState': u'running'}
cause:      Timeout after waiting 600 seconds for directors to be stable

For some reason director 2-1-A has missing peers

Firmware logs contain: Director failure messages in stpl-iofwd.c

128.xxx.xxx.xx/cpu0/log:5988:W/"0060166fbf92144854-2":67300:<0>2017/07/18 17:18:23.61: utl/0 ASSERT: /export/local1/jenkins/clone_D50.10/nsfw/snac/stdf/ext/stpl-iofwd.c:stdfStpl_destroyITNexus/245: failed to detach it nexus
128.xxx.xxx.xx/cpu0/log:5988:W/"0060166fbf92144854-2":67301:<4>2017/07/18 17:18:23.61: floor/4 tower halting with status 1
128.xxx.xxx.xx/cpu0/log:5988:W/"0060166fcaa6144854-2":67416:<0>2017/07/18 17:18:23.61: utl/0 ASSERT: /export/local1/jenkins/clone_D50.10/nsfw/snac/stdf/ext/stpl-iofwd.c:stdfStpl_destroyITNexus/245: failed to detach it nexus
128.xxx.xxx.xx/cpu0/log:5988:W/"0060166fcaa6144854-2":67417:<4>2017/07/18 17:18:23.61: floor/4 tower halting with status 1


Firmware logs will show different nPortId from the same initiator:

128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:17:27: io-port/37 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca48) type target is ready.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:17:27: stdf/17 FCP connection established.  IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:17:40: io-port/37 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca49) type target is ready.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:17:40: stdf/17 FCP connection established.  IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:17:40: io-port/38 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca48) type target is closing.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:17:40: stdf/18 FCP connection lost.  IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:18:24: io-port/38 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca49) type target is closing.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:19:35: io-port/37 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca4a) type target is ready.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:19:35: stdf/17 FCP connection established.  IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<6>2017/07/18 17:34:36: io-port/37 fc-01 (A0-FC01): login with 0x200000cc05bb004e (nPortId 0x33ca4a) type target is ready.
128.xxx.xxx.xx/cpu0/log:5988:W/"1234567891234567-2":12345:<4>2017/07/18 17:34:36: stdf/17 FCP connection established.  IT: dmbut170_C2_CHM_0 (0x200000cc05bb004e) A0-FC01 (0xc001448782380100)



 

Cause

The current VPLEX VS6 Emulex driver (ELX)  does not support the same initiator logging in to the same target with two different nPortIds.  This causes the I/O Forwarding (IOFWDing) peer director to suffer a software exception during an NDU or during a GenU.  This will result in a Data Unavailable (DU) event and the NDU to rollback or GenU failure.

Resolution

Permanent Fix:
The issue is addressed in GeoSynchrony 6.0 SP1 P6 and later.  Upgrades to this and later VPLEX code levels are not vulnerable to this issue.

Workaround:

Fabrics assign an nPortId to each initiator. The following are the noted behavior for switches with respect to nPortID:

  • Brocade switches running in FC-SW mode, without using NPIV will assign the same nPortId to an initiator port.
  • Brocade switches running in FC-SW mode, using NPIV, can assign a different nPortId when an initiator re-connects to the fabric.
  • Brocade switches running in AG mode can assign a different nPortId when an initiator re-connects to the fabric.
  • Cisco switches running in FC-SW mode with the persistent FCID feature enabled will assign the same nPortId to an initiator port.
  • Cisco switches running in NPV mode can assign a different nPortId when an initiator re-connects to the fabric.

The following are VPLEX Engineering recommendations to avoid this issue during an NDU or GenU:

Affected Products

VPLEX Series

Products

VPLEX Series, VPLEX VS6
Article Properties
Article Number: 000167355
Article Type: Solution
Last Modified: 04 May 2026
Version:  5
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.