Unsolved
This post is more than 5 years old
1 Rookie
•
35 Posts
0
55080
November 12th, 2013 03:00
Optimizing failover time 6224 stack
Last week I've installed a new stacking unit in our network. Before last week, we had only one core switch, which is obviously a very large SPOF. This was the first time I've configured a stack in a network. It all works great, but I think that our failover time could be faster, in case of one switch is going down.
It takes about 30 seconds before a client is connected again to all the network resources. 30 seconds typically sounds like spanning tree may be causing this, but I am not sure of that. I tested the failover time by unplugging one switch of power, which simulates a UPS of other power failure (of course, each switch is connected to another UPS).
Our configuration / network looks like this: Two 6224's stacked Most of our access switches are 35xxP series
The access switches are connected directly into the core switch over two links divided to each switch of the stack. Those two links are configures in port channel mode to increase bandwith and failover automaticity.
For example access switch one G3 is connected to 1/g6 of our stack and G4 is connected to 2/g6.
The partial configuration:6224 stack:
stack
member 1 1
member 2 1
exit
switch 1 priority 12
switch 2 priority 10
~~~~~
interface ethernet 1/g6
channel-group 6 mode auto
exit
interface ethernet 2/g11
channel-group 6 mode auto
exit
~~~~~
interface port-channel 6
service-policy in
spanning-tree guard root
switchport mode general
switchport general allowed vlan add 12,20 tagged
switchport general allowed vlan add 1 tagged
exit
Spanning tree:
Spanning tree :Enabled - BPDU Flooding :Disabled - Portfast BPDU filtering :Disabled - mode :rstpCST Regional Root: 10:00:D0:67:E5:75:6D:D9
Regional Root Path Cost: 0
ROOT ID
Address D0:67:E5:75:6D:D9
This Switch is the Root.
Hello Time 2 Sec Max Age 20 sec Forward Delay 15 sec TxHoldCount 6 sec
Access switches:
spanning-tree mode rstp
interface range ethernet e(1-24)
spanning-tree portfast
exit
interface range ethernet e(1-24)
spanning-tree bpduguard
exit
interface port-channel 1
switchport mode general
switchport general allowed vlan add 12,20
exit
~~~~
interface range ethernet g(3-4)
channel-group 1 mode auto
exit
One idea I had was to disable spanning tree, but spanning tree should never be disabled because there is always a possibility to create a loop by accident. For testing purposes I've disabled STP on the port-channel on both sides, but that didn't speed up the failover time.
Does anyone see optimizations to speed up the failover time? In my test environment I had a failover time of almost zero our one ping transmit failure (though I had the same config, but it seems not to be...). It would be great I can achieve this time again.


vwijhe
1 Rookie
•
35 Posts
0
November 12th, 2013 23:00
Agree, that's what we use currenlty.
That's the way I am testing indeed.
The current firmware version is 3.3.7.3, so that's almost the latest version. I'll try the latest version next time we have maintanance.
I am not sure about this one. There is a difference between which switch is going down, but I am not sure if this has something to do with the master switch our just with the active path... I'll test this in the near future as well.
vwijhe
1 Rookie
•
35 Posts
0
November 21st, 2013 00:00
Yesterday I've tested the failover times again. I think I have some interesting information.
First I've updated the firmware from 3.3.7.3 to 3.3.8.2.
Secondly I run several tests with and without spanning tree. Even with spanning tree enabled, the failover time is the same. I had disabled spanning-tree temporarily on all the switches between the end devices. So spanning tree is not an issue for me.
Are these two thoughts correct?
If so, we can continue looking to the failover times I think so.
If the slave switch is going down, the failover time is between 15-20 seconds. No matter what spanning tree is enabled or disabled.
If the master switch is going down, the failover time is about 45 seconds to a client in our serverfarm, and about 15-20 seconds to a client on another access switch. To make things easier to understand for all of you I've created a partial network draw. I am talking about stack 1. I don't touch stack 2 in these tests:
All the switches have RSTP enabled. Stack 1 is the root bridge for STP. Here are the switch details:
Maybe I should upgrade the firmware of our second lan stack, but I want to share this information: When unplugging the slave switch, I see many logs but I am not sure if there is something wrong or not:
Trying to attach more units.....
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 658 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 659 %% Msg Send failed for port 625
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 661 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 662 %% Msg Send failed for port 634
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 666 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 667 %% Msg Send failed for port 635
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 670 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 671 %% Msg Send failed for port 630
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 672 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 673 %% Msg Send failed for port 629
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 674 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 675 %% Msg Send failed for port 631
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 677 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 678 %% Msg Send failed for port 633
n188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 680 %% Checkpoi
mPeassssawgoer dt:ransmission to unit 1 failed for IP(25).
]188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768
cPkapsts_wtoarsdk:.c(363) 681 %% Checkpoint message transmission to unit 1 failed for IP(25).
<188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 682 %% Checkpoint message transmission to unit 1 failed for IP(25).
<188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 683 %% Checkpoint message transmission to unit 1 failed for ARP(116).
<188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 684 %% Checkpoint message transmission to unit 1 failed for ARP(116).
<188> MAY 25 11:13:26 192.168.20.1-2 CKPT[156926768]: ckpt_task.c(363) 685 %% Checkpoint message transmission to unit 1 failed for ARP(116).
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 687 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 688 %% Msg Send failed for port 627
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 690 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 691 %% Msg Send failed for port 628
<188> MAY 25 11:13:26 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 693 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:26 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 694 %% Msg Send failed for port 632
<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 701 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 702 %% Msg Send failed for port 635
<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 703 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 704 %% Msg Send failed for port 634
<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 705 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 706 %% Msg Send failed for port 630
<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 707 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 708 %% Msg Send failed for port 629
<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 709 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 710 %% Msg Send failed for port 625
<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 716 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:28 192.168.20.1-2 SIM[132019888]: ms_api.c(481) 728 %% msMsgSend: message transmission to unit 1, failed
<188> MAY 25 11:13:28 192.168.20.1-2 DOT1S[132019888]: dot1s_transport.c(254) 729 %% Msg Send failed for port 632
The last thing that got my attention was that there is a huge difference between shutting down one switch or cutting one cable of the access switch to the lanstack. The failover time was then only one ping time-out. Of course I unplugged both cables, not at the same time but after reconnecting the other, so it has nothing to do with coinincidence or something like that.
What are your thoughts about these results?
vwijhe
1 Rookie
•
35 Posts
0
November 25th, 2013 03:00
NSF seems to be enabled already. NSF is not a recognized command in #(config) mode
LANStack2#show nsf
Administrative Status.......................... Enable
Operational Status............................. Enable
Last Startup Reason............................ Cold Auto-Restart
Time Since Last Restart........................ 4 days 17 hrs 48 mins 11 secs
Restart In Progress............................ No
Warm Restart Ready............................. Yes
Copy of Running Configuration to Backup Unit:
Status...................................... Current
Time Since Last Copy........................ 0 days 2 hrs 35 mins 17 secs
Unit NSF Support
---- -----------
1 Yes
2 Yes
Stack configuration:
LANStack2#show switch
Management Standby Preconfig Plugged-in Switch Code
SW Status Status Model ID Model ID Status Version
--- ---------- --------- ------------- ------------- ------------- -----------
1 Mgmt Sw PCT6224 PCT6224 OK 3.3.8.2
2 Stack Mbr Oper Stby PCT6224 PCT6224 OK 3.3.8.2
LANStack2#show switch 1
Switch............................ 1
Management Status................. Management Switch
Hardware Management Preference.... Unassigned
Admin Management Preference....... 12
Switch Type....................... 0xd3140001
Preconfigured Model Identifier.... PCT6224
Plugged-in Model Identifier....... PCT6224
Switch Status..................... OK
Switch Description................ PowerConnect 6224
Expected Code Type................ 0x100b000
Detected Code Version............. 3.3.8.2
Detected Code in Flash............ 3.3.8.2
Serial Number..................... CN0RN8562829822Q0045A16
Up Time........................... 4 days 18 hrs 0 mins 25 secs
LANStack2#show switch 2
Switch............................ 2
Management Status................. Stack Member
Hardware Management Preference.... Unassigned
Admin Management Preference....... 10
Switch Type....................... 0xd3140001
Preconfigured Model Identifier.... PCT6224
Plugged-in Model Identifier....... PCT6224
Switch Status..................... OK
Switch Description................ PowerConnect 6224
Expected Code Type................ 0x100b000
Detected Code Version............. 3.3.8.2
Detected Code in Flash............ 3.3.8.2
Serial Number..................... CN0RN8562829823G0012A16
Up Time........................... 4 days 17 hrs 52 mins 30 secs