network brought down by crashed pc

Question

Twice in the last few months we've had two different machines bsod and while in that state, bring down half of our network.  It's somehow flooding the switch, and the switch it's uplinked to, preventing either of them from passing any traffic.  I don't see anything out of the ordinary on wireshark, but I don't really know what I'm looking for.  Spanning Tree and Storm Control failed to make any difference (with their current configuration).  What is causing this and how can I prevent it in the future?

dell-richard g · Accepted Answer

In an earlier comment, it was mentioned that there was a unusually high number of 802.3x pause frames.

In the event that this issue occurs again, try your best, if possible to do the following:

1. Determine the rate of the pause frames. From the switch CLI, you will just need to refresh the statistics screen every couple seconds and check the difference in count between each refresh. (exact not needed, just an estimate)

2. On the switch port connected to the failed server, does the switch see the high number of pause frames on the Rx or Tx side. (in other words, is the switch receiving pause frames from the server or is the switch sending to the failed server). The Rx and Tx will show that.

3. During the outage, if the NIC is flooding your switch with pause frames at a very high rate, there is the possibility that this may cause unwanted congestion on the entire network, just as you experienced. Unfortunately, the broadcast storm features in that switch will not prevent such outages. Some of the newer DELL Force10 Switches have a extra storm control feature specific to priority flow control and link level flow control frames. When this feature is activated, the switch will block offending ports from situations like your (802.3x flooding),thus your network is protected.

Finally, you can even try to disable 802.3x flow control on that switchport connected to the offending server. This way, if it does BSOD again, there will be no pause frame flooding. The use of flow control may or may not be needed in your environment so consider that as an option.

dell-richard g · Accepted Answer

As an administrator, you will gauge packet drops on a switch, server I/O latency and response times, are user complaining about slow network, ..etc. The use of flow control is highly dependent on network environment such as number of servers, I/O transfer size, oversubscription and many other factors, hence why an admin should always monitor switch/network performance.

Some items for you to consider to resolve this issue, which will be dependent on time/expense.

1. Disabling/Removing the offending NIC and install a different vendor NIC (i.e. Broadcom)

2. Disabling TX and RX flow control on the NIC (but the switch port is the best place and/or both)

There are two issues here

1. The system does a BSOD for some reason which causes the NIC to get in a bad state sending pause frames

or

2. The NIC gets into a bad state, which then causes the BSOD and NIC sends out all the pause frames

Regardless of who caused what, it "appears" that the NIC is unexpectedly sending out infinite pause frames and bringing down the network. As you know, this is a catastrophic condition which needs to be neutralized. I would started by disabling flow control on the switchport AND the NIC then gauge network performance as well as the state of the server.

But first, clear the counters, get a idea of the number of pause frames during normal daily operations. This will let you know when/where/if there is congestion on a switch.

The above would be the most efficient way to isolate/resolve the problem. Else, you may spend a long time on tech support calls doing debugging. Once this is all isolated, you can present your findings to tech support and be one step ahead of them.

aobrien5 · Answer

The switch that the offending machine was connected to was a PC6248.

I have capture from another switch that was being effected during the issue, and another capture from the switch that was originating the issue during the issue, but after I pulled the uplink.

How should I look for logged messages?

aobrien5 · Answer

So, the port in question had a significantly higher number of Multicast packets and 802.3x Pause Frames Received.

Unfortunately, the date and time are wrong, so that makes the logging difficult, but all I see is stuff like this:

<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7865 %% 1/0/43 is transitioned from the Learning state to the Forwarding state in instance 0
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[126257760]: traputil.c(610) 7866 %% Spanning Tree Topology Change: 0, Unit: 1
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7867 %% Link Down: 1/0/43
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7868 %% Link on 1/0/43 is failed

aobrien5 · Answer

Switch is up to date - 3.3.16.1

Multicast storm control was not enabled, only broadcast. The default storm control is 5%, right?

I've enabled Mulicast storm control now. Hopefully that helps for next time.

Anything else I should check?

Thank you for your help.

dell-richard g · Answer

Curious as to what NIC vendor and model was in the offending PC.

aobrien5 · Answer

Intel(r) 82579lm Gigabit Network Connection

aobrien5 · Answer

Good info, thank you. Doing a show statistics on the port returns:

802.3x Pause Frames Received................... 10815356

802.3x Pause Frames Transmitted................ 2

Granted, that counter probably hasn't been cleared for a while, but that number is way, way beyond anything else I see, and it hasn't incremented lately.

Obviously, I'll have to watch this next time something like this happens to determine the rate, like you said.

I appreciate the insight. Do you have a reference I can use to determine if I should be disabling flow control?

aobrien5 · Answer

This is all very helpful, thank you.

Networking General

Was this post helpful?