Unsolved
This post is more than 5 years old
2 Intern
•
308 Posts
3
12509
How To Identify And Troubleshoot Slow Drain Device In Cisco SAN Environment
How To Identify And Troubleshoot Slow Drain Device In Cisco SAN Environment
In SAN environment, performance issues are the common problems we would meet. Once the performance issue occurs, the device processing speed degrades, I/O warnings pop up.
The cause of the performance issue could be hosts, storage or switch ports. For some reason, they cannot return buffer credits to upstream device, so delays, congestion and even dropped frames, cause performance issues.
The possible factors that may cause bottleneck devices can be at physical layer, e.g. fiber optic module (SFP), fiber optic cable, patch panel/ODF, or terminal devices, such as host insufficient CPU/RAM, non-optimized SQL statements, poor storage performance. It can also be caused by SAN itself, such as the actual traffic exceeds the maximum capacity.
This article will introduce how to identify and troubleshoot slow drain device in a Cisco SAN environment.
Detailed Information
1. The cause of slow drain device
To understand the cause of the bottleneck, we should understand how switches implement the flow control mechanism. The buffer credit plays a key role in the flow control. Every single switch port has several buffer credits, the number of the credits is determined by the negotiation process of the port and connected device. Only when there are available buffer credits, the port can send out a frame and then occupy a credit. Once the remote device receives the frame, sends out an acknowledge message, then the available buffer credit will be added one. Since the buffer credits are limited, if the port has no enough credit, then the network delay would happen. Certainly, if the occupied time is more than 500ms, the frame will be dropped and release the credit.
Because of the credit-based flow control mechanism, the bottleneck will lead to the congestion on the entire data path. If the path includes a cascading link, all the data transmission through this link will be affected. Therefore, a bottleneck device can cause the congestion of the entire network. It is important to identify the bottleneck device during the troubleshooting of performance issue. For endpoint devices, such as hosts or storage, the system will report bottleneck issue.
2. Clear the counters on all Cisco MDS switches
To troubleshoot performance issue, the first step is clearing the switch counters. We can use the following commands:
To clear interface counters:
MDS-9509# clear counters interface all
To clear interface counters if port-channels are configured:
MDS-9509# clear counters interface port-channel <1-256>
To clear ASIC counters it's required to 'attach' to all line cards:
MDS-9509# attach module 1
Attaching to module 1 ...
To exit type 'exit', to abort type '$.'
Bad terminal type: "ansi". Will assume vt100.
module-1# clear asic-cnt all
To clear IPS stats on a GigabitEthernet port:
MDS-9509# clear ips stats all
After clearing the counters, we recommend waiting for 4-6 hours, or half hour after the serious performance issue occurs, then use the following commands to collect logs:
show tech-support details
show logging onboard
If we already know the start time of the performance issue, we can collect directly:
show logging onboard starttime mm/dd/yy-HH:MM:SS
e.g.
show logging onboard starttime 09/29/15-06:23:50
3. Identify SAN topology
For a single switch network, all the connected device are hosts or storage. For multiple
switches network, there will be ISL links and E-Ports. Identifying the network topology can help administrators to understand the data transmission path. For example, the following “show topology” output shows the connectivity status between the host and the remote switch. The “show fcs ie” output shows all the switches information in its VSAN.
Sh9222i-2# show topology
FC Topology for VSAN 1 :
--------------------------------------------------------------------------------
Interface PeerDomain Peer Interface Peer IP Address(Switch Name)
--------------------------------------------------------------------------------
port-channel100 0x01(1) port-channel 100 10.32.167.225(Sh9222i-1)
FC Topology for VSAN 30 :
--------------------------------------------------------------------------------
Interface PeerDomain Peer Interface Peer IP Address(Switch Name)
--------------------------------------------------------------------------------
port-channel100 0x0a(10) port-channel100 10.32.167.225(Sh9222i-1)
FC Topology for VSAN 40 :
--------------------------------------------------------------------------------
Interface PeerDomain Peer Interface Peer IP Address(Switch Name)
--------------------------------------------------------------------------------
port-channel1000x91(145) port-channel100 10.32.167.225(Sh9222i-1)
Sh9222i-2# show fcs ie
IE List for VSAN: 1
-------------------------------------------------------------------------------
IE-WWN IE Mgmt-Id Mgmt-Addr (Switch-name)
-------------------------------------------------------------------------------
20:01:00:05:73:ad:26:01 S(Loc) 0xfffc03 10.32.167.226 (Sh9222i-2)
20:01:00:05:73:ad:2a:01 S(Adj) 0xfffc01 10.32.167.225 (Sh9222i-1)
[Total 2 IEs in Fabric]
IE List for VSAN: 30
-------------------------------------------------------------------------------
IE-WWN IE Mgmt-Id Mgmt-Addr (Switch-name)
-------------------------------------------------------------------------------
20:1e:00:05:73:ad:26:01 S(Loc) 0xfffc14 10.32.167.226 (Sh9222i-2)
20:1e:00:05:73:ad:2a:01 S(Adj) 0xfffc0a 10.32.167.225 (Sh9222i-1)
[Total 2 IEs in Fabric]
IE List for VSAN: 40
-------------------------------------------------------------------------------
IE-WWN IE Mgmt-Id Mgmt-Addr (Switch-name)
-------------------------------------------------------------------------------
20:28:00:05:73:ad:26:01 S(Loc) 0xfffc90 10.32.167.226 (Sh9222i-2)
20:28:00:05:73:ad:2a:01 S(Adj) 0xfffc91 10.32.167.225 (Sh9222i-1)
[Total 2 IEs in Fabric]
4. Analyze storage performance
A typical SAN topology is like this: two connected switches and four ISL links are bundled to one port-channel:
The bottlenecks could be:
Edge switches:
- Server performance issue: applications or operating system
- HBA issue: drivers or hardware issue
- Mismatch connection speed
- Abnormal virtual machine in virtualization environment
- Storage issue, such as the system is overloaded
Interconnect Links:
- Lack of buffer credit, especially for long distance transmission
- End devices are bottlenecks
- Edge switches is faster than the cascaded port.
First we should need to check the counters, the discard or CRC errors. Then we can use show logging onboard to check if there are any timeout, credit loss or packet drop errors.
This table describes the counters name and what they stand for:
The important ones that we should pay attention to are:
FCP_SW_CNTR_TX_AVG_B2B_ZERO
It stands for the time of zero Buffer Credit. Zero buffer credit doesn’t mean there is performance issue. However if the value is very high, there could be congestion somewhere in the network. Usually if the number is less than 30% of the transmission frames, then it is normal.
c3_timeout
The c3_timeout counter is used to verify if there is frame loss. When the port sends or receives a frame, it occupies a buffer credit. If more than 500ms the port doesn’t receive the response, then the transmission is failed and the frame will be dropped and the counter will be added one. This number means there is performance issue.
Show logging onboard output has many counters. We can pay more attention to the following ones:
1) TX’s lack of credit error
FCP_CNTR_QMM_CH0_LACK_OF_TRANSMIT_CREDIT
AK_FCP_CNTR_RCM_CH0_LACK_OF_CREDIT
2) All timeout or packet drop information
FCP CNTR LAF C3 TIMEOUT FRMAES DISCARD
THB_TMM_TOLB_TIMEOUT_DROP_CNT
FCP_CNTR_LAF_TOTAL_TIMEOUT_FRAMES
3) All buffer average to zero information
FCP_SW_CNTR_TX_AVG_B2B_ZERO
If these counters are very high, we should pay attention to the connected devices. They are probably bottleneck devices, or slow drain devices.
We can use the following command to check the remaining buffer credit:
Sh9222i-2# show interface fc1/7 | inc "fc|redit"
fc1/7 is trunking
Transmit B2B Credit is 250
Receive B2B Credit is 250
250 receive B2B credit remaining
250 transmit B2B credit remaining
250 low priority transmit B2B credit remaining
Example:
The port 57 on switch 1 has FCP_CNTR_QMM_CH0_LACK_OF_TRANSMIT_CREDIT or c3 Tx timeout. The port 55 on switch 2 also has timeout counters. If there is no congestion in switch or hardware issue on links, there must be a slow drain device connected to switch 2. We can use "show interface counters" command to find out this device:
Sh9222i-2# show int fc1/1 counters
fc1/1
5148945 frames output,429093396 bytes
100 discards, 0 errors
100 timeout discards, 14130 credit loss
2745464 Transmit B2B credit transitions to zero
11 Receive B2B credit transitions to zero
32 receive B2B credit remaining
1 transmit B2B credit remaining
1 low priority transmit B2B credit remaining
We can also check credit-loss or request-timeout errors in logging onboard logs.
5. Summary
Next we need to find out the cause of the bottleneck device, here are the normal steps:
1. Use "show topology" and "show fcs ie" to get a picture of SAN topology map
2. Use "show interface" to check if there are any errors or discards
3. Use "show interface transceiver details" to check the power levels of SFP modules
4. Use "show logging log" to check if there are any link failures
5. Use "show logging onboard" to check if there are any credit loss or packet drop errors.
6. Check the connected device if there is no finding from the above steps
Anonymous
5 Practitioner
5 Practitioner
•
274.2K Posts
0
March 7th, 2016 12:00
Nice post! My favorite for tough ones was always the transitions "TO zero" counters.I think DM still has these counters too.
Remember, for it to trip the "AT zero" counter, it must be at zero for 100ms or more. Often, something's going on and these counters are not accumulating.
If you have the chance to get on the core switch while it's happening, you'll see it accumulating (usually in high #'s) on the transmit side. If there's another core switch, the transmit port will be an ISL. Follow that rabbit down that hole too, and repeat until you get to the offending end port who's too busy to talk to anyone. Worse case, you may need to look inside the frame if the cause is not apparent in the end device's logs, collects, etc.