Highlighted
ECN-APJ
3 Argentium

How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment

How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment

Share: image001.png

Please click here for all contents shared by us.

Introduction

In SAN environment, the performance degrade is a common issue. In such cases, the device processing speed becomes slow, or there are many frame drop warnings, and finally affect the business applications. Usually they are one or several devices cause this problem, we call such device slow drain devices. The slow drain device could be a host, storage or connected switch. For some reason, the frames they accepted exceeded their capabilities so that they could not return enough buffer credits to uplink devices, which causes network delay, congestion or even frame lose issue. All of these would lead to performance issue. The bottleneck device could either be at physical layer, such as SFP, fiber cable and endpoint device, or a SAN design defect, for example, the actual data volume exceeds the maximum processing capability.


In this article we shall talk about how to identify and troubleshoot slow drain device in Brocade SAN environment.


Detailed Information

The cause of slow drain device

To understand the cause of the bottleneck, we should understand how switches implement the flow control mechanism. The buffer credit plays a key role in the flow control. Every single switch port has several buffer credits, the number of the credits is determined by the negotiation process of the port and connected device. Only when there are available buffer credits, the port can send out a frame and then occupy a credit. Once the remote device receives the frame, sends out an acknowledge message, then the available buffer credit will be added one. Since the buffer credits are limited, if the port has no enough credit, then the network delay would happen. Certainly, if the occupied time is more than 500ms, the frame will be dropped and release the credit.

Because of the credit-based flow control mechanism, the bottleneck will lead to the congestion on the entire data path. If the path includes a cascading link, all the data transmission through this link will be affected. Therefore, a bottleneck device can cause the congestion of the entire network. It is important to identify the bottleneck device during the troubleshooting of performance issue. For endpoint devices, such as hosts or storage, the system will report bottleneck issue. For Brocade switches, the following message will pop up,


2015/01/15-18:55:34, [AN-1004], 335118, SLOT 6 | FID 128, WARNING, CHD_1B_TLI_SAN1, Congestion bottleneck on port 10/32. 91.33 pct. of 300 secs. affected.

2015/01/15-19:00:37, [AN-1004], 335119, SLOT 6 | FID 128, WARNING, CHD_1B_TLI_SAN1, Congestion bottleneck on port 10/32. 88.67 pct. of 300 secs. affected.

2015/01/15-19:05:40, [AN-1004], 335120, SLOT 6 | FID 128, WARNING, CHD_1B_TLI_SAN1, Congestion bottleneck on port 10/32. 83.33 pct. of 300 secs. affected.

Clear the counters in Brocade switch

To troubleshoot performance issue, the first step is clearing the switch counters. We can use the following commands:


#>statsclear

#>slotstatsclear


If you’d cleared the counters before, you can directly collect supportshow or supportsave logs for analysis. If you haven’t cleared the counters, you’d better collect a copy of the current the outputs of supportshow or supportsave, then clear the counters. The first one can be used to quickly analyze which ports already have the errors, then we can check these ports first. The sfpshow command can be used to check the power levels for both TX and RX on a particular port.

Identify SAN topology

For a single switch network, all the connected device are hosts or storage. For multiple switches network, there will be ISL links and E-Ports. Identifying the network topology can help administrators to understand the data transmission path.


For example, the following islshow ouputshows the connectivity status between the Brocade switch and the remote switch. No. 1: local switch port 57 connects to remote switch CHD_1C_TLI_SAN1 port 55. No. 2: local switch port 129 connects to remote switch CHD_1D_NGN_SAN1, port 135.


islshow        :

1: 57-> 55 10:00:00:05:1e:d2:c4:00   7 CHD_1C_TLI_SAN1 sp:  8.000G bw:  8.000G TRUNK QOS

2:129->135 10:00:00:05:33:83:e3:00  5 CHD_1D_NGN_SAN1 sp:  8.000G bw: 8.000G TRUNK QOS

<truncated>





Analyze port errors


As the following diagram shows, there are two Brocade switches in the SAN network.


     image003.jpg


As the above information, we check the port 57 status with the command portstatsshow 57,

portstatsshow 57

tim_txcrd_z 1381820     Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc  0- 3:  0           0           228512      231010       

tim_txcrd_z_vc  4- 7:  521007      401291      0           0       

tim_txcrd_z_vc 8-11:  0           0 0           0       

tim_txcrd_z_vc 12-15: 0           0           0           0       

er_rx_c3_timeout 0           Class 3 receive frames discarded due to timeout

er_tx_c3_timeout 23 Class 3 transmit frames discarded due to timeout



The Time TX Credit Zero counter shows the duration of the zero buffer credit. Zero buffer credit doesn’t mean there is performance issue. However if the value is very high, there could be congestion somewhere in the network. Usually if the number is less than 30% of the transmission frames, then it is normal.



The c3_timeout counter is used to verify if there is frame loss. Prior to FOS 6.3.1, the counter has no direction. After FOS 6.3.1, it is replaced with the er_rx_c3_timeout and er_tx_c3_timeout counters. When the port sends or receives a frame, it occupies a buffer credit. If more than 500ms the port doesn’t receive the response, then the transmission is failed and the frame will be dropped and the counter will be added one. This number means there is performance issue. In this case, er_tx_c3_timeout is not zero.



Let’s take a look at the downstream port,

portstatsshow 55

tim_txcrd_z 1259255     Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc  0- 3:  0           0           239711      218720       

tim_txcrd_z_vc  4- 7:  403321      397503      0           0       

tim_txcrd_z_vc 8-11:  0           0 0           0       

tim_txcrd_z_vc 12-15: 0           0           0           0       

er_rx_c3_timeout 31 Class 3 receive frames discarded due to timeout

er_tx_c3_timeout 0           Class 3 transmit frames discarded due to timeout



The er_rx_c3_timeout counter is not zero which means it also exceeded 500ms and dropped the frames. Please be noted that the upstream er_tx_c3_timeout is not always equal to the downstream er_rx_c3_timeout, it depends on the time that you clear the counters and collect the logs.


We've checked the ISL links between two switches, now let’s find out the congestion device. We saw the er_tx_c3_timeout on the upstream port, and the er_rx_c3_timeout on the downstream port, there should be an F-Port on upstream switch while an F-port on downstream switch.


How to find out all these abnormal ports? We check the porterrshow output of these switches. Finally we find the port 21 and port 27 have some problem:


portstatsshow 21

er_rx_c3_timeout 22 Class 3 receive frames discarded due to timeout

er_tx_c3_timeout 0           Class 3 transmit frames discarded due to timeout

portstatsshow 27

er_rx_c3_timeout 0           Class 3 receive frames discarded due to timeout

er_tx_c3_timeout 31 Class 3 transmit frames discarded due to timeout



Let’s take a look at the diagram again:



image004.jpg



Are the any ports also affected? Since there is only one ISL link between two switch, so all the ports on the data transmission path have been affected as well. Please be noted that the port 26 on the downstream switch hasn’t been affected since its data is congested on the upstream switch.


image005.jpg



For multiple switches SAN environment, we can also follow the above steps to find out the abnormal device from the portstatsshow output. For single switch environment, we only need to check the F-Ports.

Troublshoot bottleneck devices

Next we need to find out the cause of the bottleneck device, here are the normal steps:

1. Use porterrshow or portstatsshow to check if there is errors at physical layer

2. Use sfpshow to check the power levels of SFP modules

3. Use switchshow to check the port status

4. Use fabriclog –show to check if there is reset port.

5. Check the connected device if there is no finding from the above steps


Back to this case, we find there is a few errors at physical layer, and the power level of RX is less than -7dBm. So we need to check the fiber cable between the switch and the device.


portstatsshow 27

er_enc_out 34181       Encoding error outside of frames

er_bad_os 23541       Invalid ordered set

sfpshow 27

RX Power:    -23.0   dBm (0.5  uW) 10.0 uW 1258.9 uW   15.8   uW1000.0 uW

TX Power: -3.2    dBm (477.3 uW)125.9  uW 631.0  uW  158.5 uW   562.3  uW



image006.jpg




After replacing the fiber cable, the problem was solved which indicates the bad fiber cable caused the problem. Sometimes we might not be able to find any problem on switches, then we should check if there is any problem on the connected device (e.g. HBA card).

Summary


The key point of troubleshooting Brocade SAN performance issue is looking for the bottleneck device through the congestion data path. Understanding the difference between er_rx_c3_timeout and er_tx_c3_timeout is very important.


We suggest clearing the counters when the devices work normally. If the performance issue occurs, only the logs that are collected during that period are meaningful.




Labels (1)
2 Replies
RRR
5 Osmium

Re: How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment

I'm actually looking for a way to detect what my actual slow draining devices are. We're experiencing high latencies from time to time and we cannot find the slow device. Storage never shows any high utilization: cache is fine, disks show 15% util, cpus are in the 20%-30% regions, so nothing points to storage problems.

But how can we detect where the actual slowest part in the SAN environment is?

0 Kudos
ECN-APJ
3 Argentium

Re: How To Identify And Troubleshoot Slow Drain Device In Brocade SAN Environment

Hi RRR.

Are you using Cisco MDS switches? If yes, you may also refer to this article: http://www.cisco.com/c/en/us/products/collateral/storage-networking/mds-9700-series-multilayer-direc....

0 Kudos