aobrien5

25 Posts

4375

June 5th, 2017 09:00

network brought down by crashed pc

Twice in the last few months we've had two different machines bsod and while in that state, bring down half of our network. It's somehow flooding the switch, and the switch it's uplinked to, preventing either of them from passing any traffic. I don't see anything out of the ordinary on wireshark, but I don't really know what I'm looking for. Spanning Tree and Storm Control failed to make any difference (with their current configuration). What is causing this and how can I prevent it in the future?

Responses(13)

DR

dell-richard g

605 Posts

0

June 20th, 2017 12:00

In an earlier comment, it was mentioned that there was a unusually high number of 802.3x pause frames.

In the event that this issue occurs again, try your best, if possible to do the following:

1. Determine the rate of the pause frames. From the switch CLI, you will just need to refresh the statistics screen every couple seconds and check the difference in count between each refresh. (exact not needed, just an estimate)

2. On the switch port connected to the failed server, does the switch see the high number of pause frames on the Rx or Tx side. (in other words, is the switch receiving pause frames from the server or is the switch sending to the failed server). The Rx and Tx will show that.

3. During the outage, if the NIC is flooding your switch with pause frames at a very high rate, there is the possibility that this may cause unwanted congestion on the entire network, just as you experienced. Unfortunately, the broadcast storm features in that switch will not prevent such outages. Some of the newer DELL Force10 Switches have a extra storm control feature specific to priority flow control and link level flow control frames. When this feature is activated, the switch will block offending ports from situations like your (802.3x flooding),thus your network is protected.

Finally, you can even try to disable 802.3x flow control on that switchport connected to the offending server. This way, if it does BSOD again, there will be no pause frame flooding. The use of flow control may or may not be needed in your environment so consider that as an option.

DR

dell-richard g

605 Posts

0

June 20th, 2017 15:00

As an administrator, you will gauge packet drops on a switch, server I/O latency and response times, are user complaining about slow network, ..etc. The use of flow control is highly dependent on network environment such as number of servers, I/O transfer size, oversubscription and many other factors, hence why an admin should always monitor switch/network performance.

Some items for you to consider to resolve this issue, which will be dependent on time/expense.

1. Disabling/Removing the offending NIC and install a different vendor NIC (i.e. Broadcom)

2. Disabling TX and RX flow control on the NIC (but the switch port is the best place and/or both)

There are two issues here

1. The system does a BSOD for some reason which causes the NIC to get in a bad state sending pause frames

or

2. The NIC gets into a bad state, which then causes the BSOD and NIC sends out all the pause frames

Regardless of who caused what, it "appears" that the NIC is unexpectedly sending out infinite pause frames and bringing down the network. As you know, this is a catastrophic condition which needs to be neutralized. I would started by disabling flow control on the switchport AND the NIC then gauge network performance as well as the state of the server.

But first, clear the counters, get a idea of the number of pause frames during normal daily operations. This will let you know when/where/if there is congestion on a switch.

The above would be the most efficient way to isolate/resolve the problem. Else, you may spend a long time on tech support calls doing debugging. Once this is all isolated, you can present your findings to tech support and be one step ahead of them.

A

Anonymous

5 Practitioner

•

274.2K Posts

0

June 5th, 2017 12:00

What model switch are you working with? Did the switch log any messages around the time of the issue? A packet capture would be beneficial if it was taken during the issue, but wont be helpful if taken after the issue.

aobrien5

25 Posts

0

June 5th, 2017 13:00

The switch that the offending machine was connected to was a PC6248.

I have capture from another switch that was being effected during the issue, and another capture from the switch that was originating the issue during the issue, but after I pulled the uplink.

How should I look for logged messages?

A

Anonymous

5 Practitioner

•

274.2K Posts

0

June 5th, 2017 13:00

To view the logs run the command # show logging. Look for anything logged around the time of the issue.

Comparing the counters on the interfaces can help show what type of packets might of been flooding the switch.

Run the following command on the interface that had the problematic client, and on an interface that is normal.

# show statistics ethernet 1/g1

Then compare the counters with each other, they wont be the exact same, but should give you an indicator.

aobrien5

25 Posts

0

June 5th, 2017 14:00

So, the port in question had a significantly higher number of Multicast packets and 802.3x Pause Frames Received.

Unfortunately, the date and time are wrong, so that makes the logging difficult, but all I see is stuff like this:

<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7865 %% 1/0/43 is transitioned from the Learning state to the Forwarding state in instance 0
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[126257760]: traputil.c(610) 7866 %% Spanning Tree Topology Change: 0, Unit: 1
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7867 %% Link Down: 1/0/43
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7868 %% Link on 1/0/43 is failed

A

Anonymous

5 Practitioner

•

274.2K Posts

0

June 6th, 2017 06:00

I would look at tuning storm control to be more restrictive on the multicast traffic.

Example:

console(config-if-1/g1)#storm-control multicast level 2

This would limit multicast traffic to 2% of the link speed. You can also use the rate option for kbps.

console(config-if-1/g1)#storm-control multicast rate 50

You can use the command # show storm-control, to see what the current settings are.

What firmware is the switch currently at? Having the switch up to date could potentially help with how it handles these situations.

http://dell.to/2qwkOnX

aobrien5

25 Posts

0

June 6th, 2017 07:00

Switch is up to date - 3.3.16.1

Multicast storm control was not enabled, only broadcast. The default storm control is 5%, right?

I've enabled Mulicast storm control now. Hopefully that helps for next time.

Anything else I should check?

Thank you for your help.

A

Anonymous

5 Practitioner

•

274.2K Posts

0

June 6th, 2017 07:00

Default is 5%, and is probably a safe setting to have it set to. I cannot think of anything else right now. We have identified the traffic type being flooded, and implemented multicast storm control to try and help prevent this from affecting the switch as much.

On the client side, you may want to just double check and be sure that the client is all up to date, and further troubleshoot why the blue screen is occurring.

DR

dell-richard g

605 Posts

0

June 15th, 2017 13:00

Curious as to what NIC vendor and model was in the offending PC.

aobrien5

25 Posts

0

June 20th, 2017 12:00

Intel(r) 82579lm Gigabit Network Connection

aobrien5

25 Posts

0

June 20th, 2017 12:00

Good info, thank you. Doing a show statistics on the port returns:

802.3x Pause Frames Received................... 10815356

802.3x Pause Frames Transmitted................ 2

Granted, that counter probably hasn't been cleared for a while, but that number is way, way beyond anything else I see, and it hasn't incremented lately.

Obviously, I'll have to watch this next time something like this happens to determine the rate, like you said.

I appreciate the insight. Do you have a reference I can use to determine if I should be disabling flow control?

aobrien5

25 Posts

0

June 21st, 2017 07:00

This is all very helpful, thank you.

View All

No Events found!

Networking General

network brought down by crashed pc