Unsolved

This post is more than 5 years old

12 Posts

2268

August 24th, 2019 00:00

How to replace a switch in a VLT setup?

I have two S4048-ON (called r12-core1 and r12-core2) with DNOS Firmware 9.10(0.0) in a VLT setup.

Some background info:

All traffic on r12-core1 has stopped, looking on the physical switch no blinking occurs, only shows non-blinking light on all connected ports.
The usual working ssh login fails to connect with a timeout (network is unreachable)

I think the r12-core1 switch might hit the clock signal bug described here:

<ADMIN NOTE: Broken link has been removed from this post by Dell> (btw. does anybody else experienced the clock signal bug?)

Which is a possible reason with this uptime (show system):
Up Time : 2 yr, 51 wk, 2 day, 19 hr, 4 min

On r12-core2 a "show vlt brief" outputs:
VLT Domain Brief
------------------
Domain ID: 1
Role: Primary
Role Priority: 8192
ICL Link Status: Down
HeartBeat Status: Down
VLT Peer Status: Down
Local Unit Id: 1
Version: 6(7)
Local System MAC address: 34:17:eb:fc:c1:c4
Remote System MAC address: 00:00:00:00:00:00
Configured System MAC address: 02:00:00:00:00:01
Remote system version: 6(7)
Delay-Restore timer: 90 seconds
Delay-Restore Abort Threshold: 60 seconds
Peer-Routing : Disabled
Peer-Routing-Timeout timer: 0 seconds
Multicast peer-routing timeout: 150 seconds

Options to get VLT back in shape:

1. Power cycle r12-core1 (the non-responsive switch) and hope it did not hit the clock signal bug.
1.1 Then update the firmware to the latest and hope a fix will avoid this situation again.

2. Replace r12-core1 with another working switch running the same firmware and configuration.

Now my question is how do I replace a switch in a VLT configuration correctly?

I have a replacement switch for r12-core1, and have loaded the same firmware version and restored the configuration from r12-core1 on it (so it's identical in firmware and configuration).

Can I just change over the network cables to the new r12-core1 and power on and VLT will get back into a working state with no disruption on traffic?

Moderator

 • 

9.4K Posts

August 26th, 2019 09:00

Hi,

If everything is the same it should establish the heartbeat and sync up without a reboot fine. Page 1085 has some failure information https://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_networking/esuprt_net_fxd_prt_swtchs/force10-s4048-on_setup-guide8_en-us.pdf

12 Posts

August 30th, 2019 06:00

Thanks for the reply.

I've finally got a serial console physically connected to the switch, and below is the prompt showing that it somehow has gone into "debugger" mode:

db{1}>

Running a dmesg shows:

WARNING: 3 errors while detecting hardware; check system log.
boot device: wd0
root on md0a dumps on wd0l
dump_misc_init: max_paddr = 0x7f800000
WARNING: clock lost 5578 days
WARNING: using filesystem time
WARNING: CHECK AND RESET THE DATE!
NMI ... going to debugger

Above makes it seem that the switch might have hit the networking clock signal bug in the ATOM CPU component (not sure though, since I only see the phrase "clock lost..." and connect it to "clock signal bug" )

"Once the component has failed, the system CPU will stop functioning but traffic may continue to flow. Once encountered it is likely that the unit will not boot, and will not be recoverable. Typically the system or card will stop functioning and will hang or reboot continuously. The issue may not be observed until a reboot or power cycle occurs." <ADMIN NOTE: Broken link has been removed from this post by Dell>

Can above warnings happen without the CPU bug? (everybody I've been talking to have never experienced the bug or heard of anyone who had the bug)

Thanks

Moderator

 • 

9.4K Posts

August 30th, 2019 07:00

It could be that bug, can you private message me the service tag?

12 Posts

August 30th, 2019 11:00

Hi Josh

Done. Sent you log also. Would be nice to know if it is that bug.

12 Posts

September 1st, 2019 10:00

An update.

First did a "reboot" command on the debugger command-line interface, this froze the switch instantly.

After that, I did a power cycle (waited a few minutes), but the switch didn't show anything on the serial console, and all ports lights went off. The switch is dead and I believe the cause is the networking clock signal bug in the Atom CPU since all signs of the bug are present.

I had a spare switch which was loaded with the same configuration (configured for a VLT setup). Powered off the failed switch, and powered off the replacement switch, changed over the cables to the replacement switch and turned on the replacement switch, and everything went into a good state again. Uptime on the other switch which has been working all the time is: Up Time : 3 yr, 0 wk, 4 day, 5 hr, 29 min

So I expect this switch will fail at some point Luckily VLT works so no interruption

 

Top