Connectrix B-Series Switch Panic or HA Out of Sync Due Switch Running Out of Resources
Summary: After High Availability (HA) failover the Control Processors (CP) are not synchronized, and rebooting the standby CP does not resolve the issue.
Symptoms
Impact:
- HA is not in Synchronization after failover. Rebooting the standby CP does not resolve the issue.
- Common Access Layer daemon (CALD) process stops responding (manageability applications use CALD)
- Switch out of resources
- Switch panic
Environment:
- Dell Hardware: Connectrix ED-DCX7-4B
- Dell Hardware: Connectrix ED-DCX7-8B
- Dell Hardware: Connectrix ED-DCX6-4B
- Dell Hardware: Connectrix ED-DCX6-8B
- Dell Hardware: Connectrix ED-8510-8B
- Dell Hardware: Connectrix ED-8510-4B
- Dell Hardware: Connectrix DS-7730B
- Dell Hardware: Connectrix DS-7720B
- Dell Hardware: Connectrix DS-6630B
- Dell Hardware: Connectrix DS-6620B
- Dell Hardware: Connectrix DS-6610B
- Dell Hardware: Connectrix DS-6520B
- Dell Hardware: Connectrix DS-6510B
- Dell Hardware: Connectrix DS-6505B
- Dell Hardware: Connectrix MP-7810
- Dell Software: Secure Connect Gateway
- Dell Software: Secure Remote Services
- Dell Software: CloudIQ
- Brocade Software: Fabric OS 8.x
- Brocade Software: Fabric OS 9.x
Issue:
- The CALD daemon terminates or is unavailable and possible switch panic due to a flood of critical or high-level alerts.
- HA is out of sync, if the switch is unable to recover the CALD daemon.
- CloudIQ stops monitoring the switch
Errors:
Err dump: The symptom is a Fabric OS CALD panic:
[KSWD-1002], 36479, SLOT 1 | FFDC | CHASSIS, WARNING, SWITCH_A, Detected termination of process cald:2395. [KSWD-1002], 36774, SLOT 1 | FFDC | CHASSIS, WARNING, SWITCH_A, Detected termination of process cald:3063. [KSWD-1002], 36855, SLOT 1 | FFDC | CHASSIS, WARNING, SWITCH_A, Detected termination of process cald:3868.
Examples of PDshow:
^EUnable to handle kernel paging request for unknown fault^M ^EFaulting instruction address: 0x401b4ad8^M ^EOops taken on: 2021-02-04 at 13:57:09:090194^M ^EOops: Kernel access of bad area, sig: 7 [#1]^M ^EPREEMPT ^ESMP NR_CPUS=4 ^ELTT NESTING LEVEL : 0 ^E^M SWD: SWD:swd_close_proc:Detected termination of cald:2150 (1) SWD: SWD:swd_close_proc:exit code:11, exit sig:17, parent sig:0 Service instances out of sync cald: unable to initialize ipc: -11 cal: ASP init failure (-4) /bin/cat: write error: No space left on device /bin/cat: write error: No space left on device
HADUMP output:
== State == fcsw:0:0(2) IMG_INCOMP[A:S] IMG_COMP(1) fcsw0(M22) IMG_COMP IMG_COMP diagfss(M22) IMG_COMP IMG_COMP fc(M22) IMG_COMP IMG_COMP rt(M22) IMG_COMP IMG_COMP swc(M22) IMG_COMP IMG_COMP web(M22) IMG_COMP IMG_COMP md(M22) IMG_COMP IMG_COMP cal(M22) IMG_INCOMP IMG_COMP
ps exfcl output in the support show:
CALD failed to restart because the original daemon went into a defunct state, and when the FOS tried to initialize a new CALD daemon, it was unable to because CALD had a status indicating it was still alive.
0 0 2150 1824 18 0 0 0 - Z ? 25919:54 \_ cald <defunct>
Specific Condition:
Secure Remote Services and or Secure Connect Gateway monitoring the Switch
Cause
This was seen in FOS 8.2.3c1.
Thread out of resource condition as the result of a resource leak with the Secure Remote Support thread in CALD, spawned for sending the support show output to the Secure Remote Services server.
The reason for the failure to then restart CALD is due to a separate defect.
Root cause:
CALD failed to restart because the original daemon went into a defunct state, and when FOS tried to initialize a new cald daemon, it was unable to because CALD had a status indicating it was still alive. This resulted in the FOS being unable to put the new CALD daemon into a working state.
0 0 2150 1824 18 0 0 0 - Z ? 25919:54 \_ cald <defunct>
SWD: SWD:swd_close_proc:Detected termination of cald:2150 (1) SWD: SWD:swd_close_proc:exit code:11, exit sig:17, parent sig:0 Service instances out of sync cald: unable to initialize ipc: -11 cal: ASP init failure (-4) /bin/cat: write error: No space left on device /bin/cat: write error: No space left on device
Engineering backports both fixes into 8.2.3e.
Resolution
Fix:
Upgrade to:
- Fabric OS v8.2.3e or later
- Fabric OS v9.1.1d or later
- Fabric OS v9.2.0b or later
- Fabric OS v9.2.1 or later
Workaround:
The switch must go through a cold boot in order to recover and get the CPs synchronized. On the switch, issue the below command and pull the power cable.
sysshutdown
Closely monitor switches for Critical alerts and address the conditions causing the critical alerts promptly, or unmonitor the switch from Secure Remote Services or Secure Connect Gateway.
Additional Information
- If there is a secondary CALD process running, the switch still has to go through the recover procedure of attempting hafailover (preferably in a maintenance window) and if HA gets out of sync, the COLD reboot is needed.
Brocade DEFECT FOS-854095