Connectrix: Cisco MDS9700 DS-X9448-768K9: Link failure Link Reset failed nonempty recv queue errors seen after port hwfailure
Summary: Cisco MDS9700 DS-X9448-768K9: Link failure Link Reset failed nonempty recv queue errors seen after port hwfailure.
Symptoms
- Ports 9-12 on the second ASIC of an LC failed with hwfailure errors. Ports fc13-16 (also on the same ASIC) were dropping packets causing congestion on the switch but these ports (13-16) did not show as faulted.
- The affected Line Card is a 48 port 16 Gbps Advanced FC Module (DS-X9448-768K9)
Cause
This issue is caused by the following issue:
CSCuw59045 > MDS9700 DS-X9448-768K9 - xbar sync loss must fail all eight ports.
Symptom:
After an internal hardware failure, frame corruption or drops, or both occur on a block of four ports. The following syslog message indicates the hardware failure:
Example of port ASIC/fabric link hardware failure:
MODULE-4-MOD_WARNING: Module 4 (Serial number: JAE180605XF) reported warning fc4/9-12due to SAC sync lost in device DEV_LOCAL_SAC_ASIC (device error 0xc9101200) CALLHOME-2-EVENT: MODULE_WARNING MODULE-2-MOD_SOMEPORTS_FAILED: Module 4 (Serial number: JAE180605XF) reported failure on ports fc4/9-12 (Fibre Channel) due to Local serial link syncing exception in device DEV_LOCAL_SAC_ASIC (device error 0xc9101204)
The following port hardware failure errors are logged:
PORT-5-IF_DOWN_HW_FAILURE: %$VSAN 101%$ Interface fc4/12 is down (Hardware Failure) vmax CALLHOME-2-EVENT: PORT_FAILURE PORT-5-IF_DOWN_HW_FAILURE: %$VSAN 101%$ Interface fc4/11 is down (Hardware Failure) server1 PORT-5-IF_DOWN_HW_FAILURE: %$VSAN 101%$ Interface fc4/10 is down (Hardware Failure) server2 PORT-5-IF_DOWN_HW_FAILURE: %$VSAN 1%$ Interface fc4/9 is down (Hardware Failure) ISL
When this failure occurs, only four ports instead of all eight ports for the affected port ASIC are set to 'hwFailure' state by NX-OS. The remaining four affected ports are left enabled, but behaves as slow drain ports. When this occurs show logging onboard records the following counters incrementing:
fc1/5 |F16_TMM_TOLB_TIMEOUT_DROP_CNT |13025 |01/01/16
Other symptoms include link resets and link reset failures on unrelated interfaces. These errors are caused by the traffic destined to the four ports that were not disabled.
PORT-5-IF_DOWN_LINK_FAILURE: %$VSAN 101%$ Interface fc8/47 is down (Link failure Link Reset failed nonempty recv queue) server3 VSAN 101%$ Interface fc8/32 is down (Link failure Link Reset failed nonempty recv queue) server4
Conditions:
This issue only occurs on the MDS 9700 DS-X9448-768K9 linecard after an internal fabric link failure.
Resolution
- Verify whether only four of the eight ports on the ASIC are showing hwfailure
- Verify whether "Link failure Link Reset failed nonempty recv queue" errors are streaming for other interfaces. These port errors are a symptom of congestion in the switch and not the root cause
Workaround:
Step 1- Manually shutdown the remaining four ports on the faulty ASIC to prevent data loss or corruption.
Gen 5 port groupings: (x is the affected LC slot number)
fcx/1-8
fcx/9-16
fcx/17-24
fcx/25-32
fcx/33-40
fcx/41-48
Step 2 - Replace the affected line card.
Additional Information
Known Affected Releases:
- 6.2(1)
- 6.2(11)
- 6.2(11a)
- 6.2(11b)
- 6.2(11c)
- 6.2(11d)
- 6.2(13)
- 6.2(13a)
- 6.2(3)
- 6.2(5)
- 6.2(7)
- 6.2(9)
- 6.2(9a)
- 6.2(9b)
- 6.2(9c)
Contact Dell Support for a preventative workaround to the issue.