ECS: The system has detected a switch issue
Summary: What can I check if I receive an email alert informing me that the system has detected a switch issue.
Instructions
If the switch reported in the alert is a default Dell switch which has been replaced with a custom switch: Respond to the form in the email that assistance is required with filtering the replaced switch out of xDoctor alerting.
Gen2 default switches are Turtle, Rabbit, and Hare.
Gen3 default switches are Rabbit, Hare, Fox, and Hound.
If not then proceed with the following four checks.
-
Attempt to ping the switch reported in the alert. We should see ping succeed. In the below example however, ping does not work.
admin@node1:~> ping -c 1 rabbit.rack PING rabbit.rack (xxx.xxx.xxx.xxx) 56(84) bytes of data. From provo.rack (xxx.xxx.xxx.xxx) icmp_seq=1 Destination Host Unreachable --- rabbit.rack ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
-
Attempt to ssh to the switch in the alert. We should reach a password prompt if ssh works. In the below example however, ssh does not work.
admin@node1:~> ssh rabbit.rack ssh: connect to host rabbit.rack port 22: No route to host
-
Check for connection in the Link Layer Discovery Protocol (LLDP).
Assuming there are no custom switches:
A Gen 2 system should have Turtle, Rabbit, and Hare switches.
A Gen 3 system should have Rabbit, Hare, Fox, and Hound switches.Example below for a Gen2 system where the rabbit is missing.
admin@node1:~> sudo lldpcli show neighbors ------------------------------------------------------------------------------- LLDP neighbors: ------------------------------------------------------------------------------- Interface: private, via: LLDP, RID: 1, Time: 35 days, 16:09:52 Chassis: ChassisID: mac xx:xx:xx:xx:xx:xx SysName: turtle SysDescr: Arista Networks EOS version 4.15.6M running on an Arista Networks DCS-7048T-A MgmtIP: xxx.xxx.xxx.xxx Capability: Bridge, on Capability: Router, off Port: PortID: ifname Ethernet1 PortDescr: Nile Node01 (Data) TTL: 120 ------------------------------------------------------------------------------- Interface: slave-1, via: LLDP, RID: 2, Time: 35 days, 16:09:48 Chassis: ChassisID: mac xx:xx:xx:xx:xx:xx SysName: hare SysDescr: Arista Networks EOS version 4.16.6M running on an Arista Networks DCS-7150S-24 MgmtIP: xxx.xxx.xxx.xxx Capability: Bridge, on Capability: Router, off Port: PortID: ifname Ethernet9 PortDescr: MLAG group 1 TTL: 120 ------------------------------------------------------------------------------- -
-
On Gen2 systems, turtle is the management switch. If it is possible to ssh to turtle, then check connection status to rabbit and hare switches by running the below three commands.
# ssh turtle.rack # en # show interfaces status | grep Mgmt
We should see that both switches marked as connect. In the example below however, we can see that one of the connections is marked as notconnect.
admin@node1:~> ssh turtle.rack Password: Last login: Wed Nov 27 23:08:48 2019 from xxx.xxx.xxx.xxx turtle>en turtle#show interfaces status | grep Mgmt Et49 Mgmt Port-Secondary 10Ge switch connected 2 a-full a-1G 1000BASE-T Et50 Mgmt Port-Primary 10Gbe switch notconnect 2 auto auto 1000BASE-T
-
On Gen3 systems, fox, and hound are both management switches, but fox manages the management links to rabbit and hare. If it is possible to ssh to fox, then check connection status to rabbit and hare switches by running the below two commands.
# ssh fox.rack # show interfaces status | grep MGMT
We should see that both switches marked as up. In the example below however, we can see that the hare connection is down.
admin@node1:~> ssh fox.rack fox# show interface status | grep MGMT Eth 1/1/33 Rabbit MGMT up 1000M full A 2 - Eth 1/1/35 Hare MGMT down 0 full A 2 -
-
-
If any of the above checks fail, then respond to the form in the email that assistance is required including the outputs gathered above.
Failure states for these checks are:
- Ping does not work.
- ssh does not work.
- The switch is missing from LLDP.
- Management switch reports a notconnect/down connection.
If all checks pass, then this may be a false alert or caused by something like expected site maintenance. If this alert repeats and all checks are still passing, then respond to the form in the email that assistance is required with an intermittent switch alert.