Connectrix B-Series Switch Panic or HA Out of Sync Due Switch Running Out of Resources

Summary: After High Availability (HA) failover the Control Processors (CP) are not synchronized, and rebooting the standby CP does not resolve the issue.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Impact:

  • HA is not in Synchronization after failover. Rebooting the standby CP does not resolve the issue.
  • Common Access Layer daemon (CALD) process stops responding (manageability applications use CALD)
  • Switch out of resources
  • Switch panic

Environment:

  • Dell Hardware: Connectrix ED-DCX7-4B
  • Dell Hardware: Connectrix ED-DCX7-8B
  • Dell Hardware: Connectrix ED-DCX6-4B
  • Dell Hardware: Connectrix ED-DCX6-8B
  • Dell Hardware: Connectrix ED-8510-8B
  • Dell Hardware: Connectrix ED-8510-4B
  • Dell Hardware: Connectrix DS-7730B
  • Dell Hardware: Connectrix DS-7720B
  • Dell Hardware: Connectrix DS-6630B
  • Dell Hardware: Connectrix DS-6620B
  • Dell Hardware: Connectrix DS-6610B
  • Dell Hardware: Connectrix DS-6520B
  • Dell Hardware: Connectrix DS-6510B
  • Dell Hardware: Connectrix DS-6505B
  • Dell Hardware: Connectrix MP-7810
  • Dell Software: Secure Connect Gateway
  • Dell Software: Secure Remote Services
  • Dell Software: CloudIQ
  • Brocade Software: Fabric OS 8.x
  • Brocade Software: Fabric OS 9.x

Issue:

  • The CALD daemon terminates or is unavailable and possible switch panic due to a flood of critical or high-level alerts.
  • HA is out of sync, if the switch is unable to recover the CALD daemon.
  • CloudIQ stops monitoring the switch

Errors:
Err dump: The symptom is a Fabric OS CALD panic:

[KSWD-1002], 36479, SLOT 1 | FFDC | CHASSIS, WARNING, SWITCH_A, Detected termination of process cald:2395.
[KSWD-1002], 36774, SLOT 1 | FFDC | CHASSIS, WARNING, SWITCH_A, Detected termination of process cald:3063.
[KSWD-1002], 36855, SLOT 1 | FFDC | CHASSIS, WARNING, SWITCH_A, Detected termination of process cald:3868.

Examples of PDshow:

^EUnable to handle kernel paging request for unknown fault^M
^EFaulting instruction address: 0x401b4ad8^M
^EOops taken on: 2021-02-04 at 13:57:09:090194^M
^EOops: Kernel access of bad area, sig: 7 [#1]^M
^EPREEMPT ^ESMP NR_CPUS=4 ^ELTT NESTING LEVEL : 0 ^E^M


SWD: SWD:swd_close_proc:Detected termination of cald:2150 (1)
SWD: SWD:swd_close_proc:exit code:11, exit sig:17, parent sig:0
Service instances out of sync
cald: unable to initialize ipc: -11
cal: ASP init failure (-4)
/bin/cat: write error: No space left on device
/bin/cat: write error: No space left on device

HADUMP output:

== State ==
   fcsw:0:0(2) IMG_INCOMP[A:S]    IMG_COMP(1)
     fcsw0(M22)    IMG_COMP    IMG_COMP    
   diagfss(M22)    IMG_COMP    IMG_COMP    
        fc(M22)    IMG_COMP    IMG_COMP    
        rt(M22)    IMG_COMP    IMG_COMP    
       swc(M22)    IMG_COMP    IMG_COMP    
       web(M22)    IMG_COMP    IMG_COMP    
        md(M22)    IMG_COMP    IMG_COMP    
       cal(M22)    IMG_INCOMP    IMG_COMP

ps exfcl output in the support show:
CALD failed to restart because the original daemon went into a defunct state, and when the FOS tried to initialize a new CALD daemon, it was unable to because CALD had a status indicating it was still alive.

0     0  2150  1824  18   0      0     0 -      Z    ?        25919:54  \_ cald <defunct>

Specific Condition:
Secure Remote Services and or Secure Connect Gateway monitoring the Switch

Cause

This was seen in FOS 8.2.3c1.
Thread out of resource condition as the result of a resource leak with the Secure Remote Support thread in CALD, spawned for sending the support show output to the Secure Remote Services server.

The reason for the failure to then restart CALD is due to a separate defect.

Root cause:
CALD failed to restart because the original daemon went into a defunct state, and when FOS tried to initialize a new cald daemon, it was unable to because CALD had a status indicating it was still alive. This resulted in the FOS being unable to put the new CALD daemon into a working state.

0     0  2150  1824  18   0      0     0 -      Z    ?        25919:54  \_ cald <defunct>
SWD: SWD:swd_close_proc:Detected termination of cald:2150 (1)
SWD: SWD:swd_close_proc:exit code:11, exit sig:17, parent sig:0
Service instances out of sync
cald: unable to initialize ipc: -11
cal: ASP init failure (-4)
/bin/cat: write error: No space left on device
/bin/cat: write error: No space left on device

Engineering backports both fixes into 8.2.3e.

Resolution

Fix:
Upgrade to:

  • Fabric OS v8.2.3e or later
  • Fabric OS v9.1.1d or later
  • Fabric OS v9.2.0b or later
  • Fabric OS v9.2.1 or later

Workaround:
The switch must go through a cold boot in order to recover and get the CPs synchronized. On the switch, issue the below command and pull the power cable.

sysshutdown

Closely monitor switches for Critical alerts and address the conditions causing the critical alerts promptly, or unmonitor the switch from Secure Remote Services or Secure Connect Gateway.

Additional Information

  • If there is a secondary CALD process running, the switch still has to go through the recover procedure of attempting hafailover (preferably in a maintenance window) and if HA gets out of sync, the COLD reboot is needed.
Brocade DEFECT FOS-853249
Brocade DEFECT FOS-854095

Affected Products

Connectrix B-Series, Secure Connect Gateway, CloudIQ, EMC Secure Remote Services
Article Properties
Article Number: 000220385
Article Type: Solution
Last Modified: 05 Apr 2024
Version:  7
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.