3 Argentum

iSCSI vs. TCP Delayed-ACK

iSCSI vs. TCP Delayed-ACK


     Have you ever been asked to disable TCP Delayed-ACK when hitting a slow read performance issue on the VNX or CLARiiON system? And the support engineers always ask you to disable TCP Delayed-ACK on the host side for the first try. The question is why? We are going to figure out the reason in this article.

Detailed Information

     Let’s first take a look at what TCP delayed-ACK is. TCP requires ACK for each of the bytes sending out. Rather than requiring ACK for each TCP segment, the receiver side’s TCP delays ACK until the second segment arrive to improve efficiency. For Windows, these two segments can be any size, but for Linux/Unix, both must be full MSS (maximum segment size). If the receiver side has data needs to be sent back to the sender, it will ACK the segment immediately. We call this behavior as piggybacking. The last case leading to ACK back is when the TCP delayed-ACK timer expire, which is commonly 200ms ~ 500ms, depending on the TCP specific implementation.

     It’s time to correlate iSCSI and TCP Delayed-ACK. Why must we disable TCP Delayed-ACK on an iSCSI enabled VNX or CLARiiON system? Actually, we won’t have a big problem with Delayed-ACK enabled on a normal operating network. But on network that has already congested for whatever reason, you will hit significant read performance degradtion issue. Congestion actually means packet dropping to TCP, so TCP will trigger slow start and congetion avoidance to recovery the loss and send buffer. Most storage systems implement a relatively conservative TCP retranmission algorithm. They retransmit each of the lost packet one by one, which means until getting a ACK from the host reading the array, the array will not retransmit the next segemt. But almost each of the current TCP stack implemetation gets Delayed-ACK enabled by default for efficiency. So, the host’s TCP stack will not ACK the array until getting the second retransmission. And finally, both must be waitting for the 200ms ~ 500ms dealyed timer expire, which leads to a temporary dead lock. That’s why you will hit a slow read performance on a congested network.Because the total throughput is being reduced due to the delay.

     The recovery processes documented in RFC is not that conservative. The TCP send window should be increasing more aggressively than what we mentioned above.



Author: Steve Zhou



Please click here for for all contents shared by us.

0 Kudos