In an earlier comment, it was mentioned that there was a unusually high number of 802.3x pause frames.
In the event that this issue occurs again, try your best, if possible to do the following:
1. Determine the rate of the pause frames. From the switch CLI, you will just need to refresh the statistics screen every couple seconds and check the difference in count between each refresh. (exact not needed, just an estimate)
2. On the switch port connected to the failed server, does the switch see the high number of pause frames on the Rx or Tx side. (in other words, is the switch receiving pause frames from the server or is the switch sending to the failed server). The Rx and Tx will show that.
3. During the outage, if the NIC is flooding your switch with pause frames at a very high rate, there is the possibility that this may cause unwanted congestion on the entire network, just as you experienced. Unfortunately, the broadcast storm features in that switch will not prevent such outages. Some of the newer DELL Force10 Switches have a extra storm control feature specific to priority flow control and link level flow control frames. When this feature is activated, the switch will block offending ports from situations like your (802.3x flooding),thus your network is protected.
Finally, you can even try to disable 802.3x flow control on that switchport connected to the offending server. This way, if it does BSOD again, there will be no pause frame flooding. The use of flow control may or may not be needed in your environment so consider that as an option.
As an administrator, you will gauge packet drops on a switch, server I/O latency and response times, are user complaining about slow network, ..etc. The use of flow control is highly dependent on network environment such as number of servers, I/O transfer size, oversubscription and many other factors, hence why an admin should always monitor switch/network performance.
Some items for you to consider to resolve this issue, which will be dependent on time/expense.
1. Disabling/Removing the offending NIC and install a different vendor NIC (i.e. Broadcom)
2. Disabling TX and RX flow control on the NIC (but the switch port is the best place and/or both)
There are two issues here
1. The system does a BSOD for some reason which causes the NIC to get in a bad state sending pause frames
or
2. The NIC gets into a bad state, which then causes the BSOD and NIC sends out all the pause frames
Regardless of who caused what, it "appears" that the NIC is unexpectedly sending out infinite pause frames and bringing down the network. As you know, this is a catastrophic condition which needs to be neutralized. I would started by disabling flow control on the switchport AND the NIC then gauge network performance as well as the state of the server.
But first, clear the counters, get a idea of the number of pause frames during normal daily operations. This will let you know when/where/if there is congestion on a switch.
The above would be the most efficient way to isolate/resolve the problem. Else, you may spend a long time on tech support calls doing debugging. Once this is all isolated, you can present your findings to tech support and be one step ahead of them.
The switch that the offending machine was connected to was a PC6248.
I have capture from another switch that was being effected during the issue, and another capture from the switch that was originating the issue during the issue, but after I pulled the uplink.
So, the port in question had a significantly higher number of Multicast packets and 802.3x Pause Frames Received.
Unfortunately, the date and time are wrong, so that makes the logging difficult, but all I see is stuff like this:
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7865 %% 1/0/43 is transitioned from the Learning state to the Forwarding state in instance 0 <189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[126257760]: traputil.c(610) 7866 %% Spanning Tree Topology Change: 0, Unit: 1 <189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7867 %% Link Down: 1/0/43 <189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7868 %% Link on 1/0/43 is failed
Granted, that counter probably hasn't been cleared for a while, but that number is way, way beyond anything else I see, and it hasn't incremented lately.
Obviously, I'll have to watch this next time something like this happens to determine the rate, like you said.
I appreciate the insight. Do you have a reference I can use to determine if I should be disabling flow control?
dell-richard g
605 Posts
0
June 20th, 2017 12:00
In an earlier comment, it was mentioned that there was a unusually high number of 802.3x pause frames.
In the event that this issue occurs again, try your best, if possible to do the following:
1. Determine the rate of the pause frames. From the switch CLI, you will just need to refresh the statistics screen every couple seconds and check the difference in count between each refresh. (exact not needed, just an estimate)
2. On the switch port connected to the failed server, does the switch see the high number of pause frames on the Rx or Tx side. (in other words, is the switch receiving pause frames from the server or is the switch sending to the failed server). The Rx and Tx will show that.
3. During the outage, if the NIC is flooding your switch with pause frames at a very high rate, there is the possibility that this may cause unwanted congestion on the entire network, just as you experienced. Unfortunately, the broadcast storm features in that switch will not prevent such outages. Some of the newer DELL Force10 Switches have a extra storm control feature specific to priority flow control and link level flow control frames. When this feature is activated, the switch will block offending ports from situations like your (802.3x flooding),thus your network is protected.
Finally, you can even try to disable 802.3x flow control on that switchport connected to the offending server. This way, if it does BSOD again, there will be no pause frame flooding. The use of flow control may or may not be needed in your environment so consider that as an option.
dell-richard g
605 Posts
0
June 20th, 2017 15:00
As an administrator, you will gauge packet drops on a switch, server I/O latency and response times, are user complaining about slow network, ..etc. The use of flow control is highly dependent on network environment such as number of servers, I/O transfer size, oversubscription and many other factors, hence why an admin should always monitor switch/network performance.
Some items for you to consider to resolve this issue, which will be dependent on time/expense.
1. Disabling/Removing the offending NIC and install a different vendor NIC (i.e. Broadcom)
2. Disabling TX and RX flow control on the NIC (but the switch port is the best place and/or both)
There are two issues here
1. The system does a BSOD for some reason which causes the NIC to get in a bad state sending pause frames
or
2. The NIC gets into a bad state, which then causes the BSOD and NIC sends out all the pause frames
Regardless of who caused what, it "appears" that the NIC is unexpectedly sending out infinite pause frames and bringing down the network. As you know, this is a catastrophic condition which needs to be neutralized. I would started by disabling flow control on the switchport AND the NIC then gauge network performance as well as the state of the server.
But first, clear the counters, get a idea of the number of pause frames during normal daily operations. This will let you know when/where/if there is congestion on a switch.
The above would be the most efficient way to isolate/resolve the problem. Else, you may spend a long time on tech support calls doing debugging. Once this is all isolated, you can present your findings to tech support and be one step ahead of them.
aobrien5
25 Posts
0
June 5th, 2017 13:00
The switch that the offending machine was connected to was a PC6248.
I have capture from another switch that was being effected during the issue, and another capture from the switch that was originating the issue during the issue, but after I pulled the uplink.
How should I look for logged messages?
aobrien5
25 Posts
0
June 5th, 2017 14:00
So, the port in question had a significantly higher number of Multicast packets and 802.3x Pause Frames Received.
Unfortunately, the date and time are wrong, so that makes the logging difficult, but all I see is stuff like this:
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7865 %% 1/0/43 is transitioned from the Learning state to the Forwarding state in instance 0
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[126257760]: traputil.c(610) 7866 %% Spanning Tree Topology Change: 0, Unit: 1
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7867 %% Link Down: 1/0/43
<189> NOV 10 05:47:30 10.1.0.46-1 TRAPMGR[153276384]: traputil.c(610) 7868 %% Link on 1/0/43 is failed
aobrien5
25 Posts
0
June 6th, 2017 07:00
Switch is up to date - 3.3.16.1
Multicast storm control was not enabled, only broadcast. The default storm control is 5%, right?
I've enabled Mulicast storm control now. Hopefully that helps for next time.
Anything else I should check?
Thank you for your help.
dell-richard g
605 Posts
0
June 15th, 2017 13:00
Curious as to what NIC vendor and model was in the offending PC.
aobrien5
25 Posts
0
June 20th, 2017 12:00
Intel(r) 82579lm Gigabit Network Connection
aobrien5
25 Posts
0
June 20th, 2017 12:00
Good info, thank you. Doing a show statistics on the port returns:
802.3x Pause Frames Received................... 10815356
802.3x Pause Frames Transmitted................ 2
Granted, that counter probably hasn't been cleared for a while, but that number is way, way beyond anything else I see, and it hasn't incremented lately.
Obviously, I'll have to watch this next time something like this happens to determine the rate, like you said.
I appreciate the insight. Do you have a reference I can use to determine if I should be disabling flow control?
aobrien5
25 Posts
0
June 21st, 2017 07:00
This is all very helpful, thank you.