Consulta de temas

Troubleshooting the Switch Fabric Module


Table of Contents:
  1. Introduction
  2. Determining the Last Power-Down Reason
  3. Troubleshooting Symptoms
  4. Information to Collect if You Open a TAC Case


1.

Introduction


This document is primarily for troubleshooting the Switch Fabric Module (SFM) on an E-Series system, but it can also be applied to C-Series SFMs.
In the E-Series, the SFM is a discrete component, called a field replaceable unit (FRU). In the C-Series, the switch fabric is integrated into the RPM. Nevertheless, FTOS commands for managing the SFM, including all those described in this document, except where noted, are useful on the C-Series.
In rare cases, an SFM fails to initialize at bootup or after an upgrade, or it may power down unexpectedly during operation. This document addresses those cases.


2.

Determining the Last Power-Down Reason


The system trace function, as shown in the show trace command output, reports when an SFM has been powered down or power-cycled. You can look for log messages entitled "Found SFM #, last power-cycle reason:", as highlighted below in a sample of show trace output.
Force10#show trace 100 | grep SFM
[2/19 13:18:59] RAM-(RpmAvailMgr):Send data sync msg (42) to task 4 SFM Config State ).
[2/19 13:22:47] TSM-(tsm):Receive SFM 7 SFM_DETECT REMOVE event.
[2/19 13:22:47] TSM-(tsm):tsmSfmRemove: Remove SFM 7
[2/19 13:22:47] TSM-(tsm):tsmSfmRemove: SFM 7 is powered off.
[2/19 13:22:48] TSM-(tsm):tsmSfmRemove: SFM 7 is powered on.
[2/19 13:22:49] TSM-(tsm):Set SFM minor alarm
[2/19 13:22:49] TSM-(tsm):tsmSfmRemove:8: SW FAB is good after removing SFM 7
[2/19 13:22:50] TSM-(tsm):Receive SFM 7 SFM_DETECT INSERT event.
[2/19 13:22:50] TSM-(tsm):SFM 7 is reset with SFM Card insert event, bring up the card
[2/19 13:22:50] TSM-(tsm):Found SFM 7, last power-cycle reason: power on with cause of DEFAULT
[2/19 13:22:50] TSM-(tsm):TSM initilizes SFM 7...
[2/19 13:22:51] ****** ERROR CHMGR-(chmgr):SFM 7 not present or bad slot id
[2/19 13:22:52] TSM-(tsm):Clear SFM minor alarm
[2/19 13:22:52] TSM-(tsm):tsmSfmAdd:8: LC is in service, no PP test. SFM 7 standby. numSfmFound = 9
[2/19 13:22:52] TSM-(tsm):Receive SFM 7 RESET_DETECT ASSERT event.
[2/19 13:22:52] TSM-(tsm):SFM 7 reset is cleared, no action

Generally, the system trace will display three reasons for an SFM reset:
  1. remote-power-off – Reported most often since the SFM is powered off and on when the system reboots, both prior to rebooting and again at system initialization. A "remote-power-off" reason also is reported when the reset sfm slot number command is issued, as this command actually power-cycles the SFM.
    Note: This command is only available in FTOS 6.5.4.0 and later, and on the E-Series.
  2. card-removed - If you remove and then reinsert an SFM, the show trace output will report card-removed as the last power-cycle reason. This status is not reported when the software detects an inability to read certain information over an internal bus and interprets this state as the SFM being removed.
  3. spurious reset
In addition, if you remotely reset the standby card from the CLI, the trace will display a reason of "remote reset".


3.

Troubleshooting Symptoms


The FTOS Chassis Manager (CHMGR) process monitors the health and status of the SFM. When the process detects a problem with the SFM, RPM0 reports a minor alarm and resets the card in an attempt to restore the SFM. The TSM process reports that an SFM has been found, and the minor alarm condition is cleared.
When the RPM reports "No working standby SFM", the switch is running without the standby SFM. One reason may be that an SFM in a particular slot is not yet online after reset. Once this SFM comes online, then the minor alarm is cleared, the chassis manager detects the new SFM and, depending on the chassis and the number of SFMs, the "Found X SFMs" message is displayed.
In general, to troubleshoot a problem with the SFM, start by capturing the following output:
  • show trace
  • show logging

    Dec 30 11:12:20 PST: %RPM0:CP %CHMGR-2-MINOR_SFM: Minor alarm: No working standby SFM
    Dec 30 11:12:20 PST: %RPM0:CP %TSM-2-SFM_RESET_PRESENT: SFM 2 reset unexpectedly
    Dec 30 11:12:22 PST: %RPM0:CP %TSM-6-SFM_DISCOVERY: Found SFM 2
    Dec 30 11:12:23 PST: %RPM0:CP %CHMGR-5-MINOR_SFM_CLR: Minor alarm cleared: Working standby SFM present
    Dec 30 11:12:23 PST: %RPM0:CP %TSM-6-SFM_DISCOVERY: Found 9 SFMs
  • show sfm all
If an SFM flaps or cycles through the minor alarm condition, the system may not be getting sufficient power. Under this condition, the system brings down the SFM first. Each SFM is configured with a voltage threshold, and, based on that value, the corresponding SFM will go down first. This process of SFM flapping occurs until the voltage to the system is stabilized. To determine whether there is sufficient power, physically verify if any Valere power rectifiers are experiencing a brick failure. See also the separate document, Troubleshooting Low Power Conditions.
The following sections explain how to troubleshoot specific errors on the SFM.

General Access Errors
There are two types of SFM general access errors:
  • "m" - MDIO error
  • "I" - I2C access error
These access errors typically point to a hardware issue.

To determine whether your SFM is experiencing a general access error, look for a relevant syslog message, such as "SFM 3 found general access error."
Feb Feb 19 04:44:02: %RPM0:CP %TSM-6-SFM_SWITCHFAB_STATE: Switch Fabric: DOWN 
Feb 19 04:44:02: %RPM0:CP %TSM-2-SFM_GENERAL_ACCESS_M: SFM 3 found general access error (type m) 
Feb 19 04:44:05: %RPM0:CP %TSM-6-SFM_DISCOVERY: Found SFM 3 
Feb 19 04:44:06: %RPM0:CP %TSM-6-SFM_SWITCHFAB_STATE: Switch Fabric: UP 
Feb 19 04:44:36: %RPM0:CP %TSM-6-SFM_SWITCHFAB_STATE: Switch Fabric: DOWN 
Feb 19 04:44:37: %RPM0:CP %CHMGR-0-MAJOR_SFM: Major alarm: Switch fabric down 
Feb 19 04:44:38: %RPM0:CP %TSM-2-SFM_UNDER_VOLT: SFM 3 powered off due to under voltage
SFM Simba PSI access error

A "Simba PSI" error on the SFM generally points to a hardware issue. (Simba refers to a hardware chip on the SFM.)
  • show trace Output
[6/4 2:13:13] TSM-(tsm):Receive SFM 1 ERR_DETECT event 
[6/4 2:13:13] TSM-(tsm):tsmSfmRemove: Remove SFM 1 
[6/4 2:13:13] TSM-(tsm):tsmSfmRemove: SFM 1 is powered off. 
[6/4 2:13:13] POLLER-(PM):doSfmSaSanErr: eventId=17, slotId=1, state=1, value[0]=0x1fd, value[1]=0x0 
[6/4 2:13:14] TSM-(tsm):tsmSfmRemove: SFM 1 is powered on. 
[6/4 2:13:14] CHMGR-(chmgr):add min alrm 12 UNKNOWN 0 0 
[6/4 2:13:14] CHMGR-(tsm):0x1382 log alrm 12 to chmgr (rc=84) 
[6/4 2:13:14] TSM-(tsm):Set SFM minor alarm 
[6/4 2:13:14] TSM-(tsm):Change SW FAB state from SW_FAB_UP_9 to 
SW_FAB_UP_8  
!—The Etherscale supports one SFM in standby mode. The Terascale requires all 9 SFMs to be operationally active.[5/4 2:13:14] ***** WARNING TSM-(tsm):Turn off SFM 1 active LED fail. 
[5/4 2:13:14] ***** WARNING TSM-(tsm):Turn on SFM 1 Status LED Amber fail. 
!—During a failure, check the Status LED.  
[5/4 2:13:15] ****** ERROR TSM-(tsm):tsmIsSfmPowerOn: 
f10SysRpmSfmCardInfoGet() failed for SFM 1 power status 
[5/4 2:13:15] ****** ERROR TSM-(tsm):CheckSFMCardPower: tsmIsSfmPowerOn() failed for SFM 1 power status 
[5/4 2:13:15] ****** ERROR TSM-(tsm):tsmHandleSfmError: Different error detected on SFM 1 (erro = 262163). SFM already 
in SFM_ERROR state 
[6/4 2:13:15] TSM-(tsm):SFM 1 ERR_DETECT event is confirmed 
[6/4 2:13:15] TSM-(tsm):Receive SFM 1 SIMAB_DETECT event 
[5/4 2:13:15] ****** ERROR TSM-(tsm):tsmIsSFMReset: SFM 1 is not 
accessible via scratch pad (SFM_FAITH_CR = 0) 
[6/4 2:13:15] TSM-(tsm):tsmSfmRemove: Remove SFM 1 
[6/4 2:13:15] TSM-(tsm):tsmSfmRemove: SFM 1 is powered off. 
[6/4 2:13:16] TSM-(tsm):tsmSfmRemove: SFM 1 is powered on. 
[5/4 2:13:16] ***** WARNING TSM-(tsm):Turn off SFM 1 active LED fail. 
[5/4 2:13:16] ***** WARNING TSM-(tsm):Turn on SFM 1 Status LED Amber fail. 
[5/4 2:13:17] ****** ERROR TSM-(tsm):tsmIsSfmPowerOn: 
f10SysRpmSfmCardInfoGet() failed for SFM 1 power status 
  • show sfm all
Force10#sh sfm all 
Switch Fabric State: up 
-- Switch Fabric Modules -- Slot Status 
--------------------------------------------------------------------------- 
0 card problem (SFM Simba PSI access error) 
1 active 
2 active 
3 active 
4 active 
5 active 
6 active 
7 active 
8 active 


“SFM failed SW FAB portpipe diags”

Typically, this status points to a hardware issue. Contact Force10 Networks TAC for troubleshooting assistance before requesting an RMA.

Force10#show chassis brief

Chassis Type : E300

Chassis Mode : TeraScale

Chassis Epoch : 10.4 micro-seconds

-- Line cards --

Slot Status NxtBoot ReqTyp CurTyp Version Ports
---------------------------------------------------------------------------
0 online online EX1YE3 EX1YE3 5.3.1.2b 1
1 online online EX1YE3 EX1YE3 5.3.1.2b 1
2 online online EX1YE3 EX1YE3 5.3.1.2b 1
3 online online EX1YE3 EX1YE3 5.3.1.2b 1
4 online online E12PE3 E12PE3 5.3.1.2b 12
5 not present

-- Route Processor Modules --

Slot Status NxtBoot Version
---------------------------------------------------------------------------
0 active online 5.3.1.2b
1 not present

Switch Fabric State: up

-- Switch Fabric Modules --

Slot Status

---------------------------------------------------------------------------

0 SW FAB diags failed (Multiple SFMs failed SW FAB portpipe diags)

1 active

[output omitted]

SFM Is Shut Down Due to High Temperature

A major alarm is reported under several conditions. One such condition is exceeding the SFM safe operating temperature, as detected by environmental-monitoring hardware and software. The show environment command may capture the high temperature condition in addition to the error messages:

Feb 27 04:52:16 UTC: %RPM0:CP %CHMGR-2-TEMP_SHUTDOWN_WARN: WARNING! SFM 6 temperature is 85C; approaching shutdown threshold of 80C)

Feb 27 04:52:16 UTC: %RPM0:CP %CHMGR-2-MAJOR_TEMP: Major alarm: chassis temperature high (SFM temperature reaches or exceeds threshold of 75C)

Feb 27 04:52:21 UTC: %RPM0:CP %CHMGR-2-MAJOR_TEMP_CLR: Major alarm cleared: chassis temperature lower (SFM 6 temperature is within threshold of 70C)

When this condition is occurring, either the SFM genuinely is too hot, or a sensor has malfunctioned. If directly adjacent SFMs are normal temperature, suspect a faulty sensor. If directly adjacent SFMs are not normal temperature, suspect a genuine overheating condition.

When the system detects a genuine over-temperature condition, it powers off the SFM until it cools down and until software determines it’s safe to re-power. Upon re-power, the SFM reset reason will be reported as "over-temperature" by the hardware. If software detects the over-temperature event and manually shuts down the SFM, the system will report an SFM reset reason of "remote power-off".

To view the programmed alarm thresholds levels, execute the show alarms threshold command:.

E600-TAC-3#show alarms threshold

-- Temperature Limits (deg C) --
-----------------------------------------------------------
Minor Minor Off Major Major Off Shutdown
Linecard 75 70 80 77 85
RPM 65 60 75 70 80
SFM 65 60 75 70 80

Use the following steps to troubleshoot this condition:

  1. Verify that a face plate is covering all slots without a line card. Without such plates, a high-temperature condition can occur within five minutes. Spare blanks are available from Force10 Networks.
  2. Ensure that the chassis is not placed on the floor.
  3. Verify sufficient cooling tiles close to the chassis.
  4. If a faulty sensor is suspected, reset the SFM remotely with the reset sfm slot number command. If the temperature really is high, then the SFM will probably not turn on and should be removed just a few inches so that the card no longer connects to the backplane and still allows proper airflow for the rest of the chassis.
    NOTE: This command is only available in FTOS 6.5.4.0 and later, and on the E-Series.
    NOTE: Exercise care when removing the SFM; if it is 85 degrees, it could be hot to the touch.
Resetting the active SFM via the reset sfm command can result in traffic disruption, and this message:
Force10#reset sfm 0 
SFM 0 is active. Resetting it might temporarily impact traffic. 
Proceed with reset? Confirm [yes/no]:
SFM Is Powered Off Due to Under-Voltage Condition

In the case of a power sag, the SFM typically powers off first. See the separate document, Troubleshooting Low Power Conditions, for more details.
Force10>show sfm 3 
Switch Fabric State: up 
-- SFM card 3 -- 
Status : power off - SFM powered off due to under-voltage 
Card Type : SFM - Switch Fabric Module 
Up Time : 0 sec 
Temperature : 33C 
Power Status : PEM0: up PEM1: up 
Serial Number : 0012632 
Part Number : 7520003706 Rev A 
Vendor Id : 01 
Date Code : 01442003

4.

Information to Collect if You Open a TAC Case


The level of information provided to Force10 Networks’ Technical Assistance Center (TAC) determines the troubleshooting detail that TAC can provide. With limited information, TAC normally recommends reseating an SFM reported in an error message and closely monitoring the SFM. If the SFM fails again, contact TAC to request further troubleshooting assistance. Please use the Create Service Request form on the isupport page and include the following information if available:
  • Console captures showing the error messages
  • Console captures showing the troubleshooting steps taken and the boot sequence during each step
  • Saved messages to a syslog server, if one is used.
  • Output from the show trace command
  • Output from the show tech-support command

 


Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.

Identificación del artículo: SLN286972

Última fecha de modificación: 02/20/2014 12:00 AM


Califique este artículo

Preciso
Útil
Fácil de comprender
¿Este artículo fue útil?
No
Envíenos sus comentarios
Los comentarios no pueden contener estos caracteres especiales: <>"(", ")", "\"
Disculpe, nuestro sistema de comentarios está actualmente inactivo. Vuelva a intentarlo más tarde.

Muchas gracias por sus comentarios.