Article Number: 541073

printer Print mail Email

Data Domain Network Performance and configuration

Summary: Data Domain Network Performance and configuration

Primary Product: Data Domain

Product: Data Domain

Last Published: 20 Apr 2020

Article Type: How To

Published Status: Online

Version: 3

Data Domain Network Performance and configuration

Article Content

Instructions
DD Network Configuration for performance

There are several steps that can be done to improve network performance on the DD systems

1 - Validate system interfaces are configured and available
2 - Measure network performance at the kernel level
3 - Increase the packet size, both internal and external.
4 - Increase the number of interfaces used
5 - Change the network memory allocation and use
6 - Reduce the number of dropped packets and the recovery method


1    Validate interfaces are configured correctly

Before doing any performance work, there is a need to make sure the systems can communicate with each other.  This is done by doing the following:
        A   Make sure the interfaces of interest on the two systems are in the running state with the right IP addresses
        B   Check the systems can ping each other using the addresses given
        C   Validate connections can be made between the two systems which can be done using telnet going in both                          directions

If these checks pass, then do the next step:


2    Measure network performance at the kernel level

There are several programs that can test the network throughput, but the primary one that is used with the DD systems is iperf.  The DD systems primarily receive data, but it can also send data when doing restores or being the source for replication.  Therefore, most of the time the iperf program will run on the DD system as a server, but for testing restore and as the source for replication iperf will be run as a client.  There are DD CLIs and bash shell commands to run the iperf.

Using the DD shell commands, the following is to be used to start iperf in listen mode on port 5001 which is the default port used for iperf:


net iperf server

The parameters that may be used with this are: ipversion, port, and window size.  Most of the time, these are not needed. Although, in an IPv6 environment, ipversion ipv6 should be used.

In the case of checking the throughput of a DD system doing restores or as the source doing replication, the iperf client would be used:

net iperf client <destination system>
     [interval <seconds>]  Default is none if not specified
     [duration <seconds>] Default is 10 seconds
     [connections <number>]   Number of connections to use, default is 1
     [window-size <bytes>]      The TCP window size to be used, default is 256 KBytes
     [ipversion ipv6]          Default is ipv4 if not specified
     [data random]                   Default is not random
     [transmit-size <bytes>]      Amount of data to send, default is no limit
     [port <number>]                Default is 5001
     [nodelay]                   Default is to wait for responses

The nodelay, port, transmit-size, and  data random  are not normally specified and ipversion is specified when IPv6 is used.  What is specified besides the destination is interval, duration, and connections depending on the network environment and link speed.  Multiple connections are useful if there is long round trip times (RTT) or interface speeds of over 10 Gb/s. Longer duration times give better steady state values, but it can also be disruptive on anything else using the interface that is under test.  The results show what can be expected at the network (TCP) level.

The results of this test will show whether the network is causing the throughput to be low or whether it is something else on the systems. If the iperf performance measurements indicate that the network is causing the throughput to be slower than expected, look to the next step to see what can be done to improve the network flow.  As a  rule of thumb  consider the following in determining what throughput is expected:

For 1 GbE interface, anything above 920 Mb/s is considered acceptable but the target would be 940 Mb/s.  For 10 GbE anything above 9 Gb/s is considered acceptable, but expect no more than 9.4 Gb/s.  For 25 GbE anything above 24 Gb/s should be considered acceptable.  For 100GbE anything above 99Gb/s should be considered acceptable.  If the throughput is less than these, then there is some network configuration that may be done.  For more details on the actual speed versus the theoretical/mathematical speed look in Appendix A.



3    Increase the internal and external packet size

There is a difference between internal and external packet sizes.  Internal packet sizes can be as big as 64KB.  The throughput is impacted by the efficiency of the network driver, the interface hardware, and the effectiveness of transferring the packets between the TCP module and the network driver.  By default, the DD network stack uses TSO (TCP Segment Offload) and LRO (Large Receive Offload) along with the fragmentation being disabled in the IP network layer to improve the network performance.  With TSO and LRO, the packets sent from the TCP and received by the TCP can be bigger than the MTU size, up to 64KB in length.  The network hardware will divide the packet into MTU size packets when sending data and combining the packets being received into larger packets.  This way there is less network overhead in handling the packets in the network stack.  As stated, all this is on by default no matter what the MTU is.  One study that was done within Data Domain found that with a 10 Gb interface the TSO and LRO using IPv6 addressing had a big impact on the performance, even if the TSO was implemented in software which is done for IPv6.  Without TSO and LRO the performance was around 5 - 6 Gb/s. To get over 9 Gb/s throughput on a 10 GbE interface, TSO is used.

Of course, the size TSO is dependent on the size of the application buffers. Large application buffers of greater than 100KB and preferable over 1 MB will help use the maximum packet size.  It is recommended that the socket buffer size be a multiple of 64K.  Setting the applications socket buffer to 1,048,576 or larger is a good idea when large transfers are being done. 

The external packet sizes are determined by the MTU value of each interface and the MTU of the network path that is used. In general, backup on the LAN should have the MTU set to 9000 and replication over a WAN should use 1500, but the actual value will depend on the network environment and the interface speed.  For 1 Gb interface, an MTU of 1500 may be suitable, but for 10 Gb interface speeds, 9000 MTU is recommended on LANs if it is supported on the LAN.

Setting the MTU correctly for the network environment is important because the IP  don t fragment  flag is enabled, and it is expected that the MTU for the WAN will be limited to 1500 or less.  It is because of this and other things such as the use of pause frames that it is strongly recommended not to have WAN and LAN traffic combined on the same interface.  Having these combined on one interface can work, of course, but the performance for the LAN or WAN or both may be impacted.  Therefore, if there are network performance concerns, it is recommended to have LAN and WAN traffic split apart, on different physical interfaces.



4    Increase the number of interfaces used

To get speeds greater than the line speed for one interface, interfaces can be bonded together for a higher combined throughput.  The combined throughput won t necessarily double for two interfaces, but it should be close depending on the distribution of the traffic which depends on the hash method being used.  There are other documents and KB articles which cover bonding, but a summary follows:

There are three aggregate bonding supported: Roundrobin, Balanced, and LACP. Roundrobin is not normally recommended, but it can work, especially in direct attached cases.  The last two use a transmit hash to distribute the packets across the interfaces.  The available transmit hashes on the DD are
 xor-L2 (XOR of source & destination mac addresses),
 xor-L2L3 (XOR of source and destination mac addresses & IP addresses),
xor-L3L4 ( XOR of source and destination IP addresses & TCP port numbers). 

The xor-L3L4 is recommended, but xor-L2L3 can be an alternative.  If there is only one source and destination, as can happen with replication or backup with only one media server, then the distribution would all go through only one interface unless xor-L3L4 is used with multiple connections. Also note, this is a transmit hash.  That means that it only works on transmitted data not on receive data.  For backup data the switch/router hash (Load Balance) controls the flow across the interfaces.  Note, contrary to some switch documentation, the hash does not have to match on both sides, but they should be close.  Specifically, the DD hash and the switch hash values do not match and in fact cannot not match because the hash on the DD is different than what is available on switches.  It does make sense to make them close.  For example, xor-L3L4 is usually recommended for the DD systems.  On the switch side the hash should be source and destination port.  If xor-L2L3 is used on the DD system, then source and destination IP should be used on the switch.

The LACP mode is recommended aggregation because of the better failover capability, but the alternative of using DDBoost with ifgroups should also be considered.  When trying to increase performance, especially with aggregation bonding and running traffic over a WAN, use multiple connections.  DDBoost with ifgroups already does this automatically.  There should be at least 4 connections used, but in many cases 8 to 16 connections or more should be considered, depending on the throughput expected.  This allows the total throughput to be less sensitive to delays and dropped packets.  There is a limit to the number of connections based on the network line speed where adding more connections will actually slow down the overall throughput, but that would be when the number of active replication connections are over 100 for a 1 Gb interface. 



5    Change the network memory allocation and use

If, after setting things up as given in step 3, the desired throughput is not achieved, there are TCP configuration and driver configuration that can be done.  At the TCP level there are the network memory allocations.  These are set to the best values for most the DD systems with the normal use of backup and replication.  There are some things that are important to keep in mind.  The first two are the Round Trip Time (RTT) and the number of dropped packets at the receiver.  On the LAN it is expected that the RTT will be less than 0.3 milliseconds.  This can be tested by doing a ping.  The congestion log will also give you this information.  For dropped packets, the interface statistics for all the active interface can be seen using the command

net show stats interfaces 

For details, look in the ethtool log.  Normally, if there is a slow down because of larger than expected RTT, network captures can show if TCP zero windows are occurring. Changing the tcp_rmem would be recommended for replication, but this has no impact on backup traffic, because the backup program sets the tcp_rmem value internally by using SOCKOPT. The value the backup programs uses can be changed via the DDFS configuration.  On the other hand, if the statistics are showing a large number of dropped packets then consider enabling pause frames and/or increasing the receive ring buffer size.  This is discussed later under driver configuration.

Most of the network performance issues come from replication, because traffic over a WAN network is where the RTT is higher than 10 milliseconds and there is a greater possibility of dropped packets either at the driver or on the network.  If there is none or a limited number of dropped packets, then the RTT can be used to guide on how big tcp_rmem should be.  There are three numbers.  The first one should never need to be changed.  It is the default and keeps the internal connections use of memory small.  Use the following formula to determine what size the middle number should be:

(Throughput in Bytes/sec) * (RTT in seconds) <= (tcp_rmem middle number)

For example,
        Throughput  = 100 Mb/s = 12.5 MB/s,
        RTT = 20 milliseconds = 0.02 sec
        The default for tcp_rmem is 4096  262144  6190560

12,500,000 * 0.02 = 250,000 which is < 262,144
No change is needed.

Another example,
        Throughput  = 50 Mb/s = 6.25 MB/s,
        RTT = 150 milliseconds = 0.15 sec
        The default for tcp_rmem is 4096  262144 6190560

6,250,000 * 0.15 = 937,500 which is > 262,144
In this case it should be increased.  The value should be a power of two.
The recommendation here would be to set it to 4 * 262144 = 1,048,576
The command to do this is:

net option set net.ipv4.tcp_rmem "4096 1048576 6190560 

The third value should be at least two times the middle figure.  In the second example, there is no reason to change it.
There may be a case where the settings get too large for the kernel memory.  In that case net.core.rmem_max may need to be changed.  The default value is 8,388,608. Also net.ipv4.tcp.mem may need to be changed.  The default value is  6187557  8250076  12375114 , but where the other values are in bytes, this value is in memory pages.  Note though, this is across all tcp connections.

This get complicated if there is a significant number of dropped packets with high RTT.  Depending on the RTT and the throughput there may be a need to change the congestion avoidance algorithm.  The default is cubic, which normally is OK, but in some cases with long RTT and  dirty networks  (lots of dropped packets on the network path) a different congestion avoidance algorithm may be needed.



6    Reduce the number of dropped packets and the recovery method

If there are too many dropped packets on the receive side, there are two driver configurations to consider:

a - Pause frames and rx ring buffer size on the receiving system if there are dropped packets or overrun errors.

b - Changing the txqueue_len may be considered on the transmit side.

The network receive code can get overwhelmed if the application is slow in reading the data received, there is not enough memory to allocate for the new data received, there is not enough processors and interrupt vectors to handle the amount of packets being receive, or the driver is waiting on locks and can t handle the receive flow from the hardware.  In any of these cases, the received data may overflow and cause some network packets to be dropped.  This condition will be recognized by the driver and the driver will send a pause frame (Xoff) to the switch to cause the switch to stop sending until the problem is at least partially resolved.  When the problem is resolved the driver sends an unpause (Xon) packet.  This is a normal way the driver uses to throttle the data receive flow.  To enable receive pause frames for eth4b interface the following command should be run:

$ net option set ethtool.pause.rx.eth4b on

To turn it off use the following:

$ net option set ethtool.pause.rx.eth4b off

This is done per interface but it can only be done for physical interfaces.  It cannot be used directly on bonded interfaces, VLAN interfaces, or alias interfaces.  Also, this is only to slow down the throughput to allow the application to catch-up.  If the slowdown is not seen in the iperf testing, there may be an application problem which this change will not resolve.  In that case, look at what performance improvements can be done with the applications being used.  Also, there are cases which the customer does not want to enable pause frames.  One reason is a pulsed interface pushes the problem back to the switches and routers where they need to slow down parts of their network or drop packets. This can only be used if allowed by the customer and only for very short intervals

Another option that can be used in conjunction with pause control or instead of pause control is to change the Ring Buffer size on specific interfaces to reduce the number of dropped packets. As part of the handling of data the driver uses a receive Ring Buffer and a transmit Ring Buffer to handle SKBs passed within the network stack.  These buffers contain the packet information to be sent or received from the network.  The Ring Buffers are also known as Driver Queue. This is a FIFO circular buffer that contains information that points to the SKBs.  The size specified determines how many SKBs can be queued to be sent (tx Ring Buffer) or received (rx Ring Buffer) while waiting further processing either by the driver for packets being sent to the hardware to be transmitted, or packets being sent to the network stack for packets being received.  If the rx Ring Buffer gets filled, any more packets will get dropped.  Hence, the need to look at the driver statistics.

The reason for the queues is to allow immediate processing or transmission when the driver hardware or network stack is ready. Otherwise a request for more data would have to be done which is slower. Initially the desire is to set the receive ring buffer to the maximum, but with a lot of connections this can cause some connections to  starve  while other get maximum benefit especially because of the TSO and LRO combination.

In general, to reduce dropped packets which may cause lost connections, it is a good idea to set the receive Ring Buffer to the maximum with interfaces used for backup or as a replication destination. The following shows the command to do this for interface eth4b:

$ se ethtool -g eth4b
Ring parameters for eth4b:
Pre-set maximums:
RX:    4078
RX Mini:   0
RX Jumbo:  0
TX:    4078
Current hardware settings:
RX:    407
RX Mini:   0
RX Jumbo:  0
TX:    4078

Since the maximum is 4078 than set the current to 4078:

$ net option set  ethtool.ring.rx.eth4b  4078

On the sending side, such as the replication source, if the TCP window size at the destination has been increased by changing tcp_rmem, the throughput is high, i.e. greater than 100 Mb/s, and the RTT is greater than 40 milliseconds, consider changing the txqueuelen on the replication source side:

$ ddsh -s net config eth1a
eth1a  Link encap:Ethernet  HWaddr 00:60:16:51:71:B8 
       inet6 addr: fe80::260:16ff:fe51:71b8/64 Scope:Link
       inet6 addr: fd1d:456b:ccc3:55ca:260:16ff:fe51:71b8/64 Scope:Global
       UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
       RX packets:27637 errors:0 dropped:0 overruns:0 frame:0
       TX packets:29585 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000
       RX bytes:2536522 (2.4 MiB)  TX bytes:3761730 (3.5 MiB)

$ ddsh -s net config eth1a txqueuelen 2000


An important note about the Linux kernel in DD OS release 7.x and higher, there are several improvements that have been made with the Linux kernel to automate the network configuration setting for the best performance.  Therefore, setting the ring buffer length may not be needed based on information given in Appendix B.



Terms
Speed:
b/s = bits per second
B/s = Bytes per second
Kb/s = 1,000 bits per second
KB/s = 1,000 Bytes per second
Mb/s = 1,000,000 bits per second
MB/s = 1,000,000 Bytes per second
Gb/s = 1,000,000,000 bits per second
GB/s = 1,000,000,000 Bytes per second

MTU = Maximum Transfer Unit which is the largest the network packet can be not including the Ethernet header.
GRO = Generic Receive Offload
GSO = Generic Segmentation Offload, convert large packets to MTU size packets at the driver.
LRO = Large Receive Offload
RTT = Round Trip Time, The time it takes to send a packet to the destination and to get a response back.
SKB = Socket Kernel Buffers
TSO = TCP Segmentation Offload, also referred to as GSO within ethtool
UFO = UDP Fragmentation Offload
Xoff = Transmitter off, Tells the sender to stop sending packets.
Xon = Transmitter on, Tells the sender to start sending packets.




Appendix A
Basic Network Throughput
The maximum network performance is primarily dependent on the speed of the interface.  The throughput of an interface will always be less than the physical speed of the interface.  For example, if the interface speed is listed as 1 Gb the throughput is expected to be no better than 99.3% of 1 Gb/s which is 993 Mb/s or 124 MB/s.  This of course is impacted by network configuration and the network use.  For example, these number are with MTU of 9000.  If the MTU is set to 1500, which is the default for TCP, the maximum throughput for 1 Gb/s interface would be 95.7% or 956.7 Mb/s which is 119.6 MB/s. 



Appendix B
Auto-performance adjustments by the Linux kernel
Byte Queue Limits (BQL) is a new feature in recent Linux kernels (> 3.3.0) that attempts to solve the problem of driver queue sizing automatically. This is accomplished by adding a layer that enables and disables queueing to the driver queue based on calculating the minimum queue size required to avoid starvation under the current system conditions.  The smaller the amount of queued data, the lower the maximum latency experienced by queued packets.

It is key to understand that the actual size of the driver queue is not changed by BQL. Rather, BQL calculates a limit of how much data (in bytes) can be queued at the current time. Any bytes over this limit must be held or dropped by the layers above the driver queue.

A real-world example may help provide a sense of how much BQL affects the amount of data that can be queued. On one test server, the driver queue size defaults to 256 descriptors. Since the Ethernet MTU is 1,500 bytes, this means up to 256 * 1,500 = 384,000 bytes can be queued to the driver queue (TSO, GSO and so forth are disabled, or this would be much higher). However, the limit value calculated by BQL is 3,012 bytes. As you can see, BQL greatly constrains the amount of data that can be queued.
BQL reduces network latency by limiting the amount of data in the driver queue to the minimum required to avoid starvation. It also has the important side effect of moving the point where most packets are queued from the driver queue, which is a simple FIFO, to the queueing discipline (QDisc) layer, which is capable of implementing much more complicated queueing strategies.



Appendix C
Packet sizing and data throughput calculations
1500 MTU
A packet w/timestamp = 1448 bytes of data, 52 bytes of header, 18 bytes for Ethernet header, 4 bytes for VLAN tagging and 8 bytes of preamble and start = 1526 bytes and 1530 bytes with VLAN
A packet w/o timestamp = 1460 bytes of data, 40 bytes of header, 18 bytes for Ethernet header  4 bytes for VLAN tagging and 8 bytes of preamble and start

Sending at 1 Gb/s means
655,307.9947575 packets per second =
653,594.7712418 packets per second ~ 954,248,366 b/s

MTU 1500 w/1 Gb/s
A packet w/o timestamp and w/o VLAN tagging = 1526 total bytes (12,208 bits) and 1460 data bytes (11,680 bits)
        81,913.4993 packets per second = 956,749,672 data bits/second = 119,593,709 data Bytes/second
A packet w/o timestamp and w/VLAN tagging = 1530 total bytes (12,240 bits) and 1460 data bytes (11,680 bits)
        81,699.3464 packets per second = 954,248,366.01307 data bits/second = 119,281,045.7516 data Bytes/second
A packet w/timestamp and w/o VLAN tagging = 1526 total bytes (12,208 bits) and 1448 data bytes (11,584 bits)
        81,913.4993 packets per second = 948,885,976.4089 data bits/second = 118,610,747.05111 data Bytes/second
A packet w/timestamp and w/VLAN tagging = 1530 total bytes (12,240 bits) and 1448 data bytes (11,584 bits)
        81,699.3464 packets per second = 946,405,228.758 data bits/second = 118,300,653.59477 data Bytes/second

MTU 9000 w/1 Gb/s
A packet w/o timestamp and w/o VLAN tagging = 9026 total bytes (72,208 bits) and 8960 data bytes (71,680 bits)
        13,848.881 packets per second = 992,687,790.8265 data bits/second = 124,085,973.85 data Bytes/second

A packet w/o timestamp and w/VLAN tagging = 9030 total bytes (72,240 bits) and 8960 data bytes (71,680 bits)
        13,842.7464 packets per second = 992,248,062.0155 data bits/second = 124,031,007.7519 data Bytes/second

A packet w/timestamp and w/o VLAN tagging = 9026 total bytes (72,208 bits) and 8948 data bytes (71,584 bits)
        13,848.881 packets per second = 991,358,298.2495 data bits/second = 123,919,787.281 data Bytes/second

A packet w/timestamp and w/VLAN tagging = 9030 total bytes (72,240 bits) and 8948 data bytes (71,584 bits)
        13,842.7464 packets per second = 990,919,158.361 data bits/second = 123,864,894.7951 data Bytes/second

MTU 1500 w/10 Gb/s, speeds will change by an order of magnitude, e.g.:

A packet w/o timestamp and w/o VLAN tagging = 1526 total bytes (12,208 bits) and 1460 data bytes (11,680 bits)
        819,134.993 packets per second = 9,567,496,723.46 data bits/second = 1,195,937,090.4325 data Bytes/second

MTU 9000 w/10 Gb/s, speeds will change by an order of magnitude, e.g.:

A packet w/o timestamp and w/o VLAN tagging = 9026 total bytes (72,208 bits) and 8960 data bytes (71,680 bits)
        138,488.81 packets per second = 9,926,877,908.265 data bits/second = 1,240,859,738.5 data Bytes/second

MTU 9000 w/100 Gb/s, speeds will change by two orders of magnitude, e.g.;

A packet w/o timestamp and w/o VLAN tagging = 9026 total bytes (72,208 bits) and 8960 data bytes (71,680 bits)
        1,384,888.1 packets per second = 99,268,779,082.65 data bits/second = 12,408,597,385 data Bytes/second


 
Notes

Instructions

DD Network Configuration for performance

There are several steps that can be done to improve network performance on the DD systems

1 - Validate system interfaces are configured and available
2 - Measure network performance at the kernel level
3 - Increase the packet size, both internal and external.
4 - Increase the number of interfaces used
5 - Change the network memory allocation and use
6 - Reduce the number of dropped packets and the recovery method


1    Validate interfaces are configured correctly

Before doing any performance work, there is a need to make sure the systems can communicate with each other.  This is done by doing the following:
        A   Make sure the interfaces of interest on the two systems are in the running state with the right IP addresses
        B   Check the systems can ping each other using the addresses given
        C   Validate connections can be made between the two systems which can be done using telnet going in both                          directions

If these checks pass, then do the next step:


2    Measure network performance at the kernel level

There are several programs that can test the network throughput, but the primary one that is used with the DD systems is iperf.  The DD systems primarily receive data, but it can also send data when doing restores or being the source for replication.  Therefore, most of the time the iperf program will run on the DD system as a server, but for testing restore and as the source for replication iperf will be run as a client.  There are DD CLIs and bash shell commands to run the iperf.

Using the DD shell commands, the following is to be used to start iperf in listen mode on port 5001 which is the default port used for iperf:


net iperf server

The parameters that may be used with this are: ipversion, port, and window size.  Most of the time, these are not needed. Although, in an IPv6 environment, ipversion ipv6 should be used.

In the case of checking the throughput of a DD system doing restores or as the source doing replication, the iperf client would be used:

net iperf client <destination system>
     [interval <seconds>]  Default is none if not specified
     [duration <seconds>] Default is 10 seconds
     [connections <number>]   Number of connections to use, default is 1
     [window-size <bytes>]      The TCP window size to be used, default is 256 KBytes
     [ipversion ipv6]          Default is ipv4 if not specified
     [data random]                   Default is not random
     [transmit-size <bytes>]      Amount of data to send, default is no limit
     [port <number>]                Default is 5001
     [nodelay]                   Default is to wait for responses

The nodelay, port, transmit-size, and  data random  are not normally specified and ipversion is specified when IPv6 is used.  What is specified besides the destination is interval, duration, and connections depending on the network environment and link speed.  Multiple connections are useful if there is long round trip times (RTT) or interface speeds of over 10 Gb/s. Longer duration times give better steady state values, but it can also be disruptive on anything else using the interface that is under test.  The results show what can be expected at the network (TCP) level.

The results of this test will show whether the network is causing the throughput to be low or whether it is something else on the systems. If the iperf performance measurements indicate that the network is causing the throughput to be slower than expected, look to the next step to see what can be done to improve the network flow.  As a  rule of thumb  consider the following in determining what throughput is expected:

For 1 GbE interface, anything above 920 Mb/s is considered acceptable but the target would be 940 Mb/s.  For 10 GbE anything above 9 Gb/s is considered acceptable, but expect no more than 9.4 Gb/s.  For 25 GbE anything above 24 Gb/s should be considered acceptable.  For 100GbE anything above 99Gb/s should be considered acceptable.  If the throughput is less than these, then there is some network configuration that may be done.  For more details on the actual speed versus the theoretical/mathematical speed look in Appendix A.



3    Increase the internal and external packet size

There is a difference between internal and external packet sizes.  Internal packet sizes can be as big as 64KB.  The throughput is impacted by the efficiency of the network driver, the interface hardware, and the effectiveness of transferring the packets between the TCP module and the network driver.  By default, the DD network stack uses TSO (TCP Segment Offload) and LRO (Large Receive Offload) along with the fragmentation being disabled in the IP network layer to improve the network performance.  With TSO and LRO, the packets sent from the TCP and received by the TCP can be bigger than the MTU size, up to 64KB in length.  The network hardware will divide the packet into MTU size packets when sending data and combining the packets being received into larger packets.  This way there is less network overhead in handling the packets in the network stack.  As stated, all this is on by default no matter what the MTU is.  One study that was done within Data Domain found that with a 10 Gb interface the TSO and LRO using IPv6 addressing had a big impact on the performance, even if the TSO was implemented in software which is done for IPv6.  Without TSO and LRO the performance was around 5 - 6 Gb/s. To get over 9 Gb/s throughput on a 10 GbE interface, TSO is used.

Of course, the size TSO is dependent on the size of the application buffers. Large application buffers of greater than 100KB and preferable over 1 MB will help use the maximum packet size.  It is recommended that the socket buffer size be a multiple of 64K.  Setting the applications socket buffer to 1,048,576 or larger is a good idea when large transfers are being done. 

The external packet sizes are determined by the MTU value of each interface and the MTU of the network path that is used. In general, backup on the LAN should have the MTU set to 9000 and replication over a WAN should use 1500, but the actual value will depend on the network environment and the interface speed.  For 1 Gb interface, an MTU of 1500 may be suitable, but for 10 Gb interface speeds, 9000 MTU is recommended on LANs if it is supported on the LAN.

Setting the MTU correctly for the network environment is important because the IP  don t fragment  flag is enabled, and it is expected that the MTU for the WAN will be limited to 1500 or less.  It is because of this and other things such as the use of pause frames that it is strongly recommended not to have WAN and LAN traffic combined on the same interface.  Having these combined on one interface can work, of course, but the performance for the LAN or WAN or both may be impacted.  Therefore, if there are network performance concerns, it is recommended to have LAN and WAN traffic split apart, on different physical interfaces.



4    Increase the number of interfaces used

To get speeds greater than the line speed for one interface, interfaces can be bonded together for a higher combined throughput.  The combined throughput won t necessarily double for two interfaces, but it should be close depending on the distribution of the traffic which depends on the hash method being used.  There are other documents and KB articles which cover bonding, but a summary follows:

There are three aggregate bonding supported: Roundrobin, Balanced, and LACP. Roundrobin is not normally recommended, but it can work, especially in direct attached cases.  The last two use a transmit hash to distribute the packets across the interfaces.  The available transmit hashes on the DD are
 xor-L2 (XOR of source & destination mac addresses),
 xor-L2L3 (XOR of source and destination mac addresses & IP addresses),
xor-L3L4 ( XOR of source and destination IP addresses & TCP port numbers). 

The xor-L3L4 is recommended, but xor-L2L3 can be an alternative.  If there is only one source and destination, as can happen with replication or backup with only one media server, then the distribution would all go through only one interface unless xor-L3L4 is used with multiple connections. Also note, this is a transmit hash.  That means that it only works on transmitted data not on receive data.  For backup data the switch/router hash (Load Balance) controls the flow across the interfaces.  Note, contrary to some switch documentation, the hash does not have to match on both sides, but they should be close.  Specifically, the DD hash and the switch hash values do not match and in fact cannot not match because the hash on the DD is different than what is available on switches.  It does make sense to make them close.  For example, xor-L3L4 is usually recommended for the DD systems.  On the switch side the hash should be source and destination port.  If xor-L2L3 is used on the DD system, then source and destination IP should be used on the switch.

The LACP mode is recommended aggregation because of the better failover capability, but the alternative of using DDBoost with ifgroups should also be considered.  When trying to increase performance, especially with aggregation bonding and running traffic over a WAN, use multiple connections.  DDBoost with ifgroups already does this automatically.  There should be at least 4 connections used, but in many cases 8 to 16 connections or more should be considered, depending on the throughput expected.  This allows the total throughput to be less sensitive to delays and dropped packets.  There is a limit to the number of connections based on the network line speed where adding more connections will actually slow down the overall throughput, but that would be when the number of active replication connections are over 100 for a 1 Gb interface. 



5    Change the network memory allocation and use

If, after setting things up as given in step 3, the desired throughput is not achieved, there are TCP configuration and driver configuration that can be done.  At the TCP level there are the network memory allocations.  These are set to the best values for most the DD systems with the normal use of backup and replication.  There are some things that are important to keep in mind.  The first two are the Round Trip Time (RTT) and the number of dropped packets at the receiver.  On the LAN it is expected that the RTT will be less than 0.3 milliseconds.  This can be tested by doing a ping.  The congestion log will also give you this information.  For dropped packets, the interface statistics for all the active interface can be seen using the command

net show stats interfaces 

For details, look in the ethtool log.  Normally, if there is a slow down because of larger than expected RTT, network captures can show if TCP zero windows are occurring. Changing the tcp_rmem would be recommended for replication, but this has no impact on backup traffic, because the backup program sets the tcp_rmem value internally by using SOCKOPT. The value the backup programs uses can be changed via the DDFS configuration.  On the other hand, if the statistics are showing a large number of dropped packets then consider enabling pause frames and/or increasing the receive ring buffer size.  This is discussed later under driver configuration.

Most of the network performance issues come from replication, because traffic over a WAN network is where the RTT is higher than 10 milliseconds and there is a greater possibility of dropped packets either at the driver or on the network.  If there is none or a limited number of dropped packets, then the RTT can be used to guide on how big tcp_rmem should be.  There are three numbers.  The first one should never need to be changed.  It is the default and keeps the internal connections use of memory small.  Use the following formula to determine what size the middle number should be:

(Throughput in Bytes/sec) * (RTT in seconds) <= (tcp_rmem middle number)

For example,
        Throughput  = 100 Mb/s = 12.5 MB/s,
        RTT = 20 milliseconds = 0.02 sec
        The default for tcp_rmem is 4096  262144  6190560

12,500,000 * 0.02 = 250,000 which is < 262,144
No change is needed.

Another example,
        Throughput  = 50 Mb/s = 6.25 MB/s,
        RTT = 150 milliseconds = 0.15 sec
        The default for tcp_rmem is 4096  262144 6190560

6,250,000 * 0.15 = 937,500 which is > 262,144
In this case it should be increased.  The value should be a power of two.
The recommendation here would be to set it to 4 * 262144 = 1,048,576
The command to do this is:

net option set net.ipv4.tcp_rmem "4096 1048576 6190560 

The third value should be at least two times the middle figure.  In the second example, there is no reason to change it.
There may be a case where the settings get too large for the kernel memory.  In that case net.core.rmem_max may need to be changed.  The default value is 8,388,608. Also net.ipv4.tcp.mem may need to be changed.  The default value is  6187557  8250076  12375114 , but where the other values are in bytes, this value is in memory pages.  Note though, this is across all tcp connections.

This get complicated if there is a significant number of dropped packets with high RTT.  Depending on the RTT and the throughput there may be a need to change the congestion avoidance algorithm.  The default is cubic, which normally is OK, but in some cases with long RTT and  dirty networks  (lots of dropped packets on the network path) a different congestion avoidance algorithm may be needed.



6    Reduce the number of dropped packets and the recovery method

If there are too many dropped packets on the receive side, there are two driver configurations to consider:

a - Pause frames and rx ring buffer size on the receiving system if there are dropped packets or overrun errors.

b - Changing the txqueue_len may be considered on the transmit side.

The network receive code can get overwhelmed if the application is slow in reading the data received, there is not enough memory to allocate for the new data received, there is not enough processors and interrupt vectors to handle the amount of packets being receive, or the driver is waiting on locks and can t handle the receive flow from the hardware.  In any of these cases, the received data may overflow and cause some network packets to be dropped.  This condition will be recognized by the driver and the driver will send a pause frame (Xoff) to the switch to cause the switch to stop sending until the problem is at least partially resolved.  When the problem is resolved the driver sends an unpause (Xon) packet.  This is a normal way the driver uses to throttle the data receive flow.  To enable receive pause frames for eth4b interface the following command should be run:

$ net option set ethtool.pause.rx.eth4b on

To turn it off use the following:

$ net option set ethtool.pause.rx.eth4b off

This is done per interface but it can only be done for physical interfaces.  It cannot be used directly on bonded interfaces, VLAN interfaces, or alias interfaces.  Also, this is only to slow down the throughput to allow the application to catch-up.  If the slowdown is not seen in the iperf testing, there may be an application problem which this change will not resolve.  In that case, look at what performance improvements can be done with the applications being used.  Also, there are cases which the customer does not want to enable pause frames.  One reason is a pulsed interface pushes the problem back to the switches and routers where they need to slow down parts of their network or drop packets. This can only be used if allowed by the customer and only for very short intervals

Another option that can be used in conjunction with pause control or instead of pause control is to change the Ring Buffer size on specific interfaces to reduce the number of dropped packets. As part of the handling of data the driver uses a receive Ring Buffer and a transmit Ring Buffer to handle SKBs passed within the network stack.  These buffers contain the packet information to be sent or received from the network.  The Ring Buffers are also known as Driver Queue. This is a FIFO circular buffer that contains information that points to the SKBs.  The size specified determines how many SKBs can be queued to be sent (tx Ring Buffer) or received (rx Ring Buffer) while waiting further processing either by the driver for packets being sent to the hardware to be transmitted, or packets being sent to the network stack for packets being received.  If the rx Ring Buffer gets filled, any more packets will get dropped.  Hence, the need to look at the driver statistics.

The reason for the queues is to allow immediate processing or transmission when the driver hardware or network stack is ready. Otherwise a request for more data would have to be done which is slower. Initially the desire is to set the receive ring buffer to the maximum, but with a lot of connections this can cause some connections to  starve  while other get maximum benefit especially because of the TSO and LRO combination.

In general, to reduce dropped packets which may cause lost connections, it is a good idea to set the receive Ring Buffer to the maximum with interfaces used for backup or as a replication destination. The following shows the command to do this for interface eth4b:

$ se ethtool -g eth4b
Ring parameters for eth4b:
Pre-set maximums:
RX:    4078
RX Mini:   0
RX Jumbo:  0
TX:    4078
Current hardware settings:
RX:    407
RX Mini:   0
RX Jumbo:  0
TX:    4078

Since the maximum is 4078 than set the current to 4078:

$ net option set  ethtool.ring.rx.eth4b  4078

On the sending side, such as the replication source, if the TCP window size at the destination has been increased by changing tcp_rmem, the throughput is high, i.e. greater than 100 Mb/s, and the RTT is greater than 40 milliseconds, consider changing the txqueuelen on the replication source side:

$ ddsh -s net config eth1a
eth1a  Link encap:Ethernet  HWaddr 00:60:16:51:71:B8 
       inet6 addr: fe80::260:16ff:fe51:71b8/64 Scope:Link
       inet6 addr: fd1d:456b:ccc3:55ca:260:16ff:fe51:71b8/64 Scope:Global
       UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
       RX packets:27637 errors:0 dropped:0 overruns:0 frame:0
       TX packets:29585 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000
       RX bytes:2536522 (2.4 MiB)  TX bytes:3761730 (3.5 MiB)

$ ddsh -s net config eth1a txqueuelen 2000


An important note about the Linux kernel in DD OS release 7.x and higher, there are several improvements that have been made with the Linux kernel to automate the network configuration setting for the best performance.  Therefore, setting the ring buffer length may not be needed based on information given in Appendix B.



Terms
Speed:
b/s = bits per second
B/s = Bytes per second
Kb/s = 1,000 bits per second
KB/s = 1,000 Bytes per second
Mb/s = 1,000,000 bits per second
MB/s = 1,000,000 Bytes per second
Gb/s = 1,000,000,000 bits per second
GB/s = 1,000,000,000 Bytes per second

MTU = Maximum Transfer Unit which is the largest the network packet can be not including the Ethernet header.
GRO = Generic Receive Offload
GSO = Generic Segmentation Offload, convert large packets to MTU size packets at the driver.
LRO = Large Receive Offload
RTT = Round Trip Time, The time it takes to send a packet to the destination and to get a response back.
SKB = Socket Kernel Buffers
TSO = TCP Segmentation Offload, also referred to as GSO within ethtool
UFO = UDP Fragmentation Offload
Xoff = Transmitter off, Tells the sender to stop sending packets.
Xon = Transmitter on, Tells the sender to start sending packets.




Appendix A
Basic Network Throughput
The maximum network performance is primarily dependent on the speed of the interface.  The throughput of an interface will always be less than the physical speed of the interface.  For example, if the interface speed is listed as 1 Gb the throughput is expected to be no better than 99.3% of 1 Gb/s which is 993 Mb/s or 124 MB/s.  This of course is impacted by network configuration and the network use.  For example, these number are with MTU of 9000.  If the MTU is set to 1500, which is the default for TCP, the maximum throughput for 1 Gb/s interface would be 95.7% or 956.7 Mb/s which is 119.6 MB/s. 



Appendix B
Auto-performance adjustments by the Linux kernel
Byte Queue Limits (BQL) is a new feature in recent Linux kernels (> 3.3.0) that attempts to solve the problem of driver queue sizing automatically. This is accomplished by adding a layer that enables and disables queueing to the driver queue based on calculating the minimum queue size required to avoid starvation under the current system conditions.  The smaller the amount of queued data, the lower the maximum latency experienced by queued packets.

It is key to understand that the actual size of the driver queue is not changed by BQL. Rather, BQL calculates a limit of how much data (in bytes) can be queued at the current time. Any bytes over this limit must be held or dropped by the layers above the driver queue.

A real-world example may help provide a sense of how much BQL affects the amount of data that can be queued. On one test server, the driver queue size defaults to 256 descriptors. Since the Ethernet MTU is 1,500 bytes, this means up to 256 * 1,500 = 384,000 bytes can be queued to the driver queue (TSO, GSO and so forth are disabled, or this would be much higher). However, the limit value calculated by BQL is 3,012 bytes. As you can see, BQL greatly constrains the amount of data that can be queued.
BQL reduces network latency by limiting the amount of data in the driver queue to the minimum required to avoid starvation. It also has the important side effect of moving the point where most packets are queued from the driver queue, which is a simple FIFO, to the queueing discipline (QDisc) layer, which is capable of implementing much more complicated queueing strategies.



Appendix C
Packet sizing and data throughput calculations
1500 MTU
A packet w/timestamp = 1448 bytes of data, 52 bytes of header, 18 bytes for Ethernet header, 4 bytes for VLAN tagging and 8 bytes of preamble and start = 1526 bytes and 1530 bytes with VLAN
A packet w/o timestamp = 1460 bytes of data, 40 bytes of header, 18 bytes for Ethernet header  4 bytes for VLAN tagging and 8 bytes of preamble and start

Sending at 1 Gb/s means
655,307.9947575 packets per second =
653,594.7712418 packets per second ~ 954,248,366 b/s

MTU 1500 w/1 Gb/s
A packet w/o timestamp and w/o VLAN tagging = 1526 total bytes (12,208 bits) and 1460 data bytes (11,680 bits)
        81,913.4993 packets per second = 956,749,672 data bits/second = 119,593,709 data Bytes/second
A packet w/o timestamp and w/VLAN tagging = 1530 total bytes (12,240 bits) and 1460 data bytes (11,680 bits)
        81,699.3464 packets per second = 954,248,366.01307 data bits/second = 119,281,045.7516 data Bytes/second
A packet w/timestamp and w/o VLAN tagging = 1526 total bytes (12,208 bits) and 1448 data bytes (11,584 bits)
        81,913.4993 packets per second = 948,885,976.4089 data bits/second = 118,610,747.05111 data Bytes/second
A packet w/timestamp and w/VLAN tagging = 1530 total bytes (12,240 bits) and 1448 data bytes (11,584 bits)
        81,699.3464 packets per second = 946,405,228.758 data bits/second = 118,300,653.59477 data Bytes/second

MTU 9000 w/1 Gb/s
A packet w/o timestamp and w/o VLAN tagging = 9026 total bytes (72,208 bits) and 8960 data bytes (71,680 bits)
        13,848.881 packets per second = 992,687,790.8265 data bits/second = 124,085,973.85 data Bytes/second

A packet w/o timestamp and w/VLAN tagging = 9030 total bytes (72,240 bits) and 8960 data bytes (71,680 bits)
        13,842.7464 packets per second = 992,248,062.0155 data bits/second = 124,031,007.7519 data Bytes/second

A packet w/timestamp and w/o VLAN tagging = 9026 total bytes (72,208 bits) and 8948 data bytes (71,584 bits)
        13,848.881 packets per second = 991,358,298.2495 data bits/second = 123,919,787.281 data Bytes/second

A packet w/timestamp and w/VLAN tagging = 9030 total bytes (72,240 bits) and 8948 data bytes (71,584 bits)
        13,842.7464 packets per second = 990,919,158.361 data bits/second = 123,864,894.7951 data Bytes/second

MTU 1500 w/10 Gb/s, speeds will change by an order of magnitude, e.g.:

A packet w/o timestamp and w/o VLAN tagging = 1526 total bytes (12,208 bits) and 1460 data bytes (11,680 bits)
        819,134.993 packets per second = 9,567,496,723.46 data bits/second = 1,195,937,090.4325 data Bytes/second

MTU 9000 w/10 Gb/s, speeds will change by an order of magnitude, e.g.:

A packet w/o timestamp and w/o VLAN tagging = 9026 total bytes (72,208 bits) and 8960 data bytes (71,680 bits)
        138,488.81 packets per second = 9,926,877,908.265 data bits/second = 1,240,859,738.5 data Bytes/second

MTU 9000 w/100 Gb/s, speeds will change by two orders of magnitude, e.g.;

A packet w/o timestamp and w/o VLAN tagging = 9026 total bytes (72,208 bits) and 8960 data bytes (71,680 bits)
        1,384,888.1 packets per second = 99,268,779,082.65 data bits/second = 12,408,597,385 data Bytes/second


 
Notes

Article Attachments

Attachments

Attachments

Article Properties

First Published

Fri Feb 07 2020 23:02:35 GMT

First Published

Fri Feb 07 2020 23:02:35 GMT

Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
0/3000 characters