Unsolved

This post is more than 5 years old

1 Rookie

 • 

35 Posts

55080

November 12th, 2013 03:00

Optimizing failover time 6224 stack

Last week I've installed a new stacking unit in our network. Before last week, we had only one core switch, which is obviously a very large SPOF. This was the first time I've configured a stack in a network. It all works great, but I think that our failover time could be faster, in case of one switch is going down.

It takes about 30 seconds before a client is connected again to all the network resources. 30 seconds typically sounds like spanning tree may be causing this, but I am not sure of that. I tested the failover time by unplugging one switch of power, which simulates a UPS of other power failure (of course, each switch is connected to another UPS).

Our configuration / network looks like this: Two 6224's stacked Most of our access switches are 35xxP series

The access switches are connected directly into the core switch over two links divided to each switch of the stack. Those two links are configures in port channel mode to increase bandwith and failover automaticity.

For example access switch one G3 is connected to 1/g6 of our stack and G4 is connected to 2/g6.

The partial configuration:
6224 stack:
stack
member 1 1
member 2 1
exit
switch 1 priority 12
switch 2 priority 10
~~~~~
interface ethernet 1/g6
channel-group 6 mode auto
exit
interface ethernet 2/g11
channel-group 6 mode auto
exit
~~~~~
interface port-channel 6
service-policy in
spanning-tree guard root
switchport mode general
switchport general allowed vlan add 12,20 tagged
switchport general allowed vlan add 1 tagged
exit

Spanning tree:

Spanning tree :Enabled - BPDU Flooding :Disabled - Portfast BPDU filtering :Disabled - mode :rstp
CST Regional Root: 10:00:D0:67:E5:75:6D:D9
Regional Root Path Cost: 0
ROOT ID
Address D0:67:E5:75:6D:D9
This Switch is the Root.
Hello Time 2 Sec Max Age 20 sec Forward Delay 15 sec TxHoldCount 6 sec

Access switches:
spanning-tree mode rstp
interface range ethernet e(1-24)
spanning-tree portfast
exit
interface range ethernet e(1-24)
spanning-tree bpduguard
exit
interface port-channel 1
switchport mode general
switchport general allowed vlan add 12,20
exit
~~~~
interface range ethernet g(3-4)
channel-group 1 mode auto
exit

One idea I had was to disable spanning tree, but spanning tree should never be disabled because there is always a possibility to create a loop by accident. For testing purposes I've disabled STP on the port-channel on both sides, but that didn't speed up the failover time.

Does anyone see optimizations to speed up the failover time? In my test environment I had a failover time of almost zero our one ping transmit failure (though I had the same config, but it seems not to be...). It would be great I can achieve this time again.

1 Rookie

 • 

35 Posts

November 12th, 2013 23:00

There are no specific commands to tune failover performance. But we can look things over and make sure configurations are optimal.

 

I recommend leaving spanning tree enabled. However portfast and bpduguard should only be used on ports that connect to end devices like a server/client. Those settings should not be configured on switch to switch connections.

Agree, that's what we use currenlty.

 

It sounds like you have the right idea with the split LAG on the switch to switch connection. When testing are you just pinging from one client to another and timing how long the destination is unreachable?

That's the way I am testing indeed.

 

What firmware level is the stack at? Having the firmware up to date can help with operability.

http://www.dell.com/support/drivers/us/en/19/DriverDetails/Product/powerconnect-6224?driverId=77XG3&osCode=NAA&fileId=3288111910&languageCode=en&categoryId=NI

The current firmware version is 3.3.7.3, so that's almost the latest version. I'll try the latest version next time we have maintanance.

 

Do test results differ if you bring down the master switch Vs. the secondary switch in the stack?

I am not sure about this one. There is a difference between which switch is going down, but I am not sure if this has something to do with the master switch our just with the active path... I'll test this in the near future as well.

1 Rookie

 • 

35 Posts

November 21st, 2013 00:00

Yesterday I've tested the failover times again. I think I have some interesting information.

First I've updated the firmware from 3.3.7.3 to 3.3.8.2.

Secondly I run several tests with and without spanning tree. Even with spanning tree enabled, the failover time is the same. I had disabled spanning-tree temporarily on all the switches between the end devices. So spanning tree is not an issue for me.

  • There is a difference in failover time between putting the master down or the secondary switch. To understand the situation better, I would like to make clear what the master switch exactly is. I think the master switch is the switch where the CLi is available and the master led is on as well. If the master switch goes down, the slave will become the master switch and will keep its master tasks, even when the previous master switch is back online.
  • The 1/8 and 2/7 leds are indicators what the preferred master switch is, but there is no guarantee that the 1/8 switch will become the master. If the 2/7 switch will boot earlier (or in case switch 1/8 is down), the 2/7 will be the master and will keep the master until the stack reloads.

Are these two thoughts correct?

If so, we can continue looking to the failover times I think so.

If the slave switch is going down, the failover time is between 15-20 seconds. No matter what spanning tree is enabled or disabled.

If the master switch is going down, the failover time is about 45 seconds to a client in our serverfarm, and about 15-20 seconds to a client on another access switch. To make things easier to understand for all of you I've created a partial network draw. I am talking about stack 1. I don't touch stack 2 in these tests:

 

 

All the switches have RSTP enabled. Stack 1 is the root bridge for STP. Here are the switch details:

Switch Firmware Model
Lanstack 1 3.3.8.2 6224
Lanstack 2 2.2.0.3 6624
Acces switch left 2.0.0.48 3524P
Access switch right  2.0.0.48 3548P

Maybe I should upgrade the firmware of our second lan stack, but I want to share this information: When unplugging the slave switch, I see many logs but I am not sure if there is something wrong or not:

Trying to attach more units.....

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 658 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 659 %% Msg Send failed for port 625

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 661 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 662 %% Msg Send failed for port 634

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 666 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 667 %% Msg Send failed for port 635

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 670 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 671 %% Msg Send failed for port 630

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 672 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 673 %% Msg Send failed for port 629

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 674 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 675 %% Msg Send failed for port 631

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 677 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 678 %% Msg Send failed for port 633

n188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 680 %% Checkpoi

mPeassssawgoer dt:ransmission to unit 1 failed for IP(25).

]188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768

cPkapsts_wtoarsdk:.c(363) 681 %% Checkpoint message transmission to unit 1 failed for IP(25).

<188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 682 %% Checkpoint message transmission to unit 1 failed for IP(25).

<188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 683 %% Checkpoint message transmission to unit 1 failed for ARP(116).

<188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 684 %% Checkpoint message transmission to unit 1 failed for ARP(116).

<188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 685 %% Checkpoint message transmission to unit 1 failed for ARP(116).

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 687 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 688 %% Msg Send failed for port 627

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 690 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 691 %% Msg Send failed for port 628

<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 693 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 694 %% Msg Send failed for port 632

<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 701 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 702 %% Msg Send failed for port 635

<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 703 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 704 %% Msg Send failed for port 634

<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 705 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 706 %% Msg Send failed for port 630

<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 707 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 708 %% Msg Send failed for port 629

<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 709 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 710 %% Msg Send failed for port 625

<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 716 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 728 %% msMsgSend: message transmission to unit 1, failed

<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 729 %% Msg Send failed for port 632

The last thing that got my attention was that there is a huge difference between shutting down one switch or cutting one cable of the access switch to the lanstack. The failover time was then only one ping time-out. Of course I unplugged both cables, not at the same time but after reconnecting the other, so it has nothing to do with coinincidence or something like that.

What are your thoughts about these results?

 

1 Rookie

 • 

35 Posts

November 25th, 2013 03:00

NSF seems to be enabled already. NSF is not a recognized command in #(config) mode

 

LANStack2#show nsf

Administrative Status.......................... Enable
Operational Status............................. Enable
Last Startup Reason............................ Cold Auto-Restart
Time Since Last Restart........................ 4 days 17 hrs 48 mins 11 secs
Restart In Progress............................ No
Warm Restart Ready............................. Yes

Copy of Running Configuration to Backup Unit:   
Status...................................... Current   
Time Since Last Copy........................ 0 days 2 hrs 35 mins 17 secs

Unit    NSF Support    
----    -----------       
1        Yes       
2        Yes

 

Stack configuration:

LANStack2#show switch

 

   Management Standby   Preconfig     Plugged-in   Switch       Code

SW Status     Status   Model ID     Model ID     Status       Version

--- ---------- --------- ------------- ------------- ------------- -----------

1   Mgmt Sw             PCT6224       PCT6224       OK           3.3.8.2

2   Stack Mbr Oper Stby PCT6224       PCT6224       OK           3.3.8.2

 

LANStack2#show switch 1

 

Switch............................ 1

Management Status................. Management Switch

Hardware Management Preference.... Unassigned

Admin Management Preference....... 12

Switch Type....................... 0xd3140001

Preconfigured Model Identifier.... PCT6224

Plugged-in Model Identifier....... PCT6224

Switch Status..................... OK

Switch Description................ PowerConnect 6224

Expected Code Type................ 0x100b000

Detected Code Version............. 3.3.8.2

Detected Code in Flash............ 3.3.8.2

Serial Number..................... CN0RN8562829822Q0045A16

Up Time........................... 4 days 18 hrs 0 mins 25 secs

 

LANStack2#show switch 2

 

Switch............................ 2

Management Status................. Stack Member

Hardware Management Preference.... Unassigned

Admin Management Preference....... 10

Switch Type....................... 0xd3140001

Preconfigured Model Identifier.... PCT6224

Plugged-in Model Identifier....... PCT6224

Switch Status..................... OK

Switch Description................ PowerConnect 6224

Expected Code Type................ 0x100b000

Detected Code Version............. 3.3.8.2

Detected Code in Flash............ 3.3.8.2

Serial Number..................... CN0RN8562829823G0012A16

Up Time........................... 4 days 17 hrs 52 mins 30 secs

 

Top