Data Domain Network Performance and Configuration

Summary: Data Domain Network Performance and Configuration

This article applies to This article does not apply to

Instructions

Note: This kb is to provide extra help to a customer to allow their networking team to optimize things in instead of contracting with Dell Professional Services.
Note: The KB doesn't imply that Dell Technical Support can address customer networking issues or optimizations.

DD Network Configuration for performance
There are several ways to improve network performance on the DD systems.

Validate system interfaces are configured and available.
Measure network performance at the kernel level.
Increase the packet size, both internal and external.
Increase the number of interfaces used.
Change the network memory allocation and use.
Reduce the number of dropped packets and the recovery method.

Validate interfaces are configured correctly.
Before doing any performance work, there is a must ensure that the systems can communicate with each other. This is done by doing the following:

Confirm the interfaces of interest on the two systems are in the running state with the correct IP addresses.
Check that the systems can ping each other using the addresses that are given.
Validate connections can be made between the two systems. This is done using telnet going in both directions.

If these checks pass, go to the next item:

Measure network performance at the kernel level.
There are several programs that test the network throughput, but the primary used with the DD systems is iperf. The DD system primarily receives data, but it can also send data when doing restores or being the source for replication. Most of the time, the iperf program runs on the DD system as a server, but for testing restore and as the source for replication, iperf is run as a client. There are DD CLIs and bash shell commands to run the iperf.

Using the DD shell commands, the following is used to start iperf in listen mode on port 5001 which is the default port for iperf:

net iperf server

The parameters that may be used with this are: ipversion, port, and window size. Most of the time, these are not needed. Although in an IPv6 environment, ipversion ipv6 would be used.

When checking the throughput of a DD system doing restores or as the source doing replication, the iperf client would be used:

net iperf client <destination system>
     [interval <seconds>]  Default is none if not specified
     [duration <seconds>] Default is 10 seconds
     [connections <number>]   Number of connections to use, default is 1
     [window-size <bytes>]      The TCP window size to be used, default is 256 KBytes
     [ipversion ipv6]          Default is ipv4 if not specified
     [data random]                   Default is not random
     [transmit-size <bytes>]      Amount of data to send, default is no limit
     [port <number>]                Default is 5001
     [nodelay]                   Default is to wait for responses

The nodelay, port, transmit-size, and data random are not specified and ipversion is specified when IPv6 is used. What is specified besides the destination is interval, duration, and connections depending on the network environment and link speed. Multiple connections are useful if there is long round-trip times (RTT) or interface speeds of over 10 Gb/s. Longer duration times give better steady state values, but it can also be disruptive on anything else using the interface that is under test. The results show what can be expected at the network (TCP) level.

The results of this test show whether the network is causing the throughput to be low or whether it is something else on the systems. If the iperf performance measurements indicate that the network is causing the throughput to be slower than expected, look to the next item to see what can be done to improve the network flow. As a general guideline, consider the following in determining what throughput is expected:

For 1 GbE interface, anything above 920 Mb/s is considered acceptable but the target is 940 Mb/s.
For 10 GbE interface, anything above 9 Gb/s is considered acceptable, but expect no more than 9.4 Gb/s.
For 25 GbE interface, anything above 24 Gb/s is considered acceptable.
For 100 GbE interface, anything above 99 Gb/s is considered acceptable.

If the throughput is less than these, then there is some network configuration that may be done. For more details on the actual speed that is compared with the theoretical or mathematical speed, look in Appendix A.

Increase the internal and external packet size.
There is a difference between internal and external packet sizes. Internal packet sizes can be as big as 64 KB. Throughput is impacted by the network driver efficiency, the interface hardware, and the effectiveness of transferring the packets between the TCP module and the network driver. By default, the DD network stack uses TCP Segment Offload (TSO) and Large Receive Offload (LRO) along with the fragmentation being disabled in the IP network layer to improve the network performance. With TSO and LRO, the packets that are sent from and received by the TCP can be bigger than the MTU size, up to 64 KB in length. The network hardware divides the packet into MTU size packets when sending data and combining the packets being received into larger packets. This way there is less network overhead in handling the packets in the network stack. As stated, all this is on by default no matter what the MTU is. One study that was done within Data Domain found that with a 10 Gb interface, the TSO and LRO using IPv6 addressing, had a big impact on the performance, even if the TSO was implemented in software which is done for IPv6. Without TSO and LRO, the performance was around 5 - 6 Gb/s. To get over 9 Gb/s throughput on a 10 GbE interface, TSO is used.

The TSO size depends the size of the application buffers. Large application buffers of greater than 100 KB and preferable over 1 MB helps use the maximum packet size. It is recommended that the socket buffer size is a multiple of 64 K. Setting the applications socket buffer to 1,048,576 or larger is a good idea when large transfers are being done.

The external packet sizes are determined by the MTU value of each interface and the MTU of the network path that is used. In general, backup on the LAN should have the MTU set to 9000 and replication over a WAN should use 1500, but the value depends on the network environment and the interface speed. For 1 Gb interface, an MTU of 1500 may be ok but for 10 Gb interface speeds, 9000 MTU is recommended on LANs if it is supported on the LAN.

Setting the MTU correctly for the network environment is important because the IP don t fragment flag is enabled, and it is expected that the MTU for the WAN is limited to 1500 or less. It is because of this and other things, such as the use of pause frames, that it is recommended not to have WAN and LAN traffic that is combined on the same interface. Having these combined on one interface can work, but the performance for the LAN or WAN or both may be impacted. If there are network performance concerns, it is recommended to have LAN and WAN traffic split apart, on different physical interfaces.

Increase the number of interfaces used.
To get speeds greater than the line speed for one interface, interfaces can be bonded together for a higher combined throughput. The combined throughput does not necessarily double for two interfaces, but it should be close depending on the distribution of the traffic which depends on the hash method being used. There are other documents and KB articles which cover bonding, but a summary follows:

There are three aggregate bondings supported:

Round robin
Balanced
LACP

Round robin is not recommended, but it can work, especially in direct attached cases. The last two use a transmit hash to distribute the packets across the interfaces. The available transmit hashes on the DD are xor-L2 (XOR of source and destination mac addresses), xor-L2L3 (XOR of source and destination mac addresses and IP addresses), xor-L3L4 (XOR of source and destination IP addresses and TCP port numbers).

The xor-L3L4 is recommended, but xor-L2L3 can be an alternative. If there is only one source and destination, as can happen with replication or backup with only one media server, then the distribution would all go through only one interface unless xor-L3L4 is used with multiple connections. Also note, this is a transmit hash. That means that it only works on transmitted data, not on receive data. For backup data, the switch or router hash (Load Balance) controls the flow across the interfaces. Note, contrary to some switch documentation, the hash does not have to match on both sides, but they should be close. The DD hash and switch hash values do not match and cannot match because the hash on the DD is different than that on switches. It does make sense to make them close. For example, xor-L3L4 is recommended for the DD systems. On the switch side, the hash should be source and destination port. If xor-L2L3 is used on the DD system, then source and destination IP should be used on the switch.

The LACP mode is the recommended aggregation because of the better failover capability, but the alternative of using DDBoost with ifgroups should also be considered. When trying to increase performance, especially with aggregation bonding and running traffic over a WAN, use multiple connections. DDBoost with ifgroups already does this automatically. There should be at least four connections that are used, but often 8 to 16 connections or more should be considered, depending on the throughput expected. This allows the total throughput to be less sensitive to delays and dropped packets. There is a limit to the number of connections that is based on the network line speed where adding more connections actually slows down the overall throughput, but that would be when the number of active replication connections are over 100 for a 1 Gb interface.

Change the network memory allocation and use.
If, after setting up as given in item 3, the wanted throughput is not achieved, there are TCP configuration and driver configuration that can be done. At the TCP level, there are the network memory allocations. These are set to the best values for most the DD systems with the normal use of backup and replication. There are some things that are important to keep in mind. The first two are the Round Trip Time (RTT) and the number of dropped packets at the receiver. On the LAN, it is expected that the RTT is less than 0.3 milliseconds. This can be tested by doing a ping. The congestion log also gives you this information. For dropped packets, the interface statistics for all the active interface can be seen using the command:

net show stats interfaces

For details, look in the ethtool log. If there is a slowdown because of larger than expected RTT, network captures can show if TCP zero windows are occurring. Changing the tcp_rmem is recommended for replication, but this has no impact on backup traffic, because the backup program sets the tcp_rmem value by using SOCKOPT. The value the backup programs uses can be changed using the DDFS configuration. Alternately, if the statistics are showing many dropped packets, then consider enabling pause frames or increasing the receive ring buffer size. This is discussed later under driver configuration.

Most of the network performance issues come from replication, because traffic over a WAN network is where the RTT is higher than 10 milliseconds and there is a greater possibility of dropped packets either at the driver or on the network. If there is none or a limited number of dropped packets, then the RTT can be used to guide on how big tcp_rmem should be.

There are three numbers.

The first number must never change. It is the default and keeps the internal connections use of memory small.
Use the following formula to determine what size the middle number should be:

(Throughput in Bytes/sec) * (RTT in seconds) <= (tcp_rmem middle number)

For example:

Throughput  = 100 Mb/s = 12.5 MB/s
RTT = 20 milliseconds = 0.02 sec
The default for tcp_rmem is 4096  262144  6190560
12,500,000 * 0.02 = 250,000 which is < 262,144

Another example:

Throughput  = 50 Mb/s = 6.25 MB/s
RTT = 150 milliseconds = 0.15 sec
The default for tcp_rmem is 4096  262144 6190560
6,250,000 * 0.15 = 937,500 which is > 262,144

In this case, it should be increased. The value should be a power of two. The recommendation here would be to set it to 4 * 262144 = 1,048,576.

The command to do this is:

net option set net.ipv4.tcp_rmem "4096 1048576 6190560

The third value should be at least two times the middle figure. In the second example, there is no reason to change it.

A case where the settings get too large for the kernel memory. In that case, net.core.rmem_max must be changed. The default value is 8,388,608. Also, net.ipv4.tcp.mem must be changed. The default value is 6187557 8250076 12375114, but where the other values are in bytes, this value is in memory pages. Note though, this is across all TCP connections.

This gets complicated if there is a significant number of dropped packets with high RTT. Depending on the RTT and the throughput, there must be a change to the congestion avoidance algorithm. The default is cubic, which is OK, but sometimes with long RTT and dirty networks (lots of dropped packets on the network path), a different congestion avoidance algorithm may be needed.

Reduce the number of dropped packets and the recovery method.
If there are too many dropped packets on the receive side, there are two driver configurations to consider:

Pause frames and rx ring buffer size on the receiving system if there are dropped packets or overrun errors.
Changing the txqueue_len may be considered on the transmit side.

The network receive code can get overwhelmed if the application is slow in reading the data received. There is not enough memory to allocate for the new data received. There are not enough processors and interrupt vectors to handle the number of packets received. The driver is waiting on locks and cannot handle the receive flow from the hardware. In any of these cases, the received data may overflow and cause some network packets to be dropped. This condition is recognized by the driver, and the driver sends a pause frame (Xoff) to the switch to cause the switch to stop sending until the issue is at least partially resolved. When the issue is resolved, the driver sends an unpause (Xon) packet. This is the normal way that the driver uses to throttle the data receive flow.

To enable receive pause frames for eth4b interface, run the following command:

net option set net.ipv4.tcp_rmem "4096 1048576 6190560

To turn it off, run the following:

$ net option set ethtool.pause.rx.eth4b off

This is done per interface, but it can only be done for physical interfaces. It cannot be used directly on bonded interfaces, VLAN interfaces, or alias interfaces. Also, this is only to slow down the throughput to allow the application to catch-up. If the slowdown is not seen in the iperf testing, there may be an application issue which this change does not resolve. In that case, look at what performance improvements can be done with the applications being used. Also, there are instances where enable pause frames are not wanted. One reason is pulsed interfaces push the issue back to the switches and routers where they must slow down parts of their network or drop packets. This can only be used if allowed and only for short intervals.

Another option that can be used with pause control or instead of pause control is to change the Ring Buffer size on specific interfaces to reduce the number of dropped packets. As part of the data handling, the driver uses a receive Ring Buffer and a transmit Ring Buffer to handle SKBs passed within the network stack. These buffers contain the packet information to be sent or received from the network. The Ring Buffers are also known as Driver Queue. This is a FIFO circular buffer that contains information that points to the SKBs. The size that is specified determines how many SKBs can be queued to be sent (tx Ring Buffer) or received (rx Ring Buffer) while waiting further processing either by the driver for packets being sent to the hardware to be transmitted, or packets being sent to the network stack for packets being received. If the rx Ring Buffer gets filled, any more packets get dropped. Hence, they must look at the driver statistics.

The reason for the queues is to allow immediate processing or transmission when the driver hardware or network stack is ready. Otherwise, a request for more data would have to be done which is slower. Initially, the desire is to set the receive ring buffer to the maximum, but with a lot of connections, this can cause some connections to starve while other get maximum benefit especially because of the TSO and LRO combination.

In general, to reduce dropped packets which may cause lost connections, it is a good idea to set the receive Ring Buffer to the maximum with interfaces that are used for backup or as a replication destination. The following shows the command to do this for interface eth4b:

$ se ethtool -g eth4b

Ring parameters for eth4b:
Pre-set maximums:
RX:    4078
RX Mini:   0
RX Jumbo:  0
TX:    4078
Current hardware settings:
RX:    407
RX Mini:   0
RX Jumbo:  0
TX:    4078

Since the maximum is 4078, set the current to 4078:

$ net option set  ethtool.ring.rx.eth4b  4078

On the sending side, such as the replication source, if the TCP window size at the destination has been increased by changing tcp_rmem, the throughput is high, that is greater than 100 Mb/s, and the RTT is greater than 40 milliseconds, consider changing the txqueuelen on the replication source side:

$ ddsh -s net config eth1a

eth1a  Link encap:Ethernet  HWaddr 00:60:16:51:71:B8 
       inet6 addr: fe80::260:16ff:fe51:71b8/64 Scope:Link
       inet6 addr: fd1d:456b:ccc3:55ca:260:16ff:fe51:71b8/64 Scope:Global
       UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
       RX packets:27637 errors:0 dropped:0 overruns:0 frame:0
       TX packets:29585 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000
       RX bytes:2536522 (2.4 MiB)  TX bytes:3761730 (3.5 MiB)

$ ddsh -s net config eth1a txqueuelen 2000

An important note about the Linux kernel in DD OS release 7.x and later, there are several improvements that were made with the Linux kernel to automate the network configuration setting for the best performance. Setting the ring buffer length may not be needed based on information that is given in Appendix B.

Terms
Speed:
b/s = bits per second
B/s = Bytes per second
Kb/s = 1,000 bits per second
KB/s = 1,000 Bytes per second
Mb/s = 1,000,000 bits per second
MB/s = 1,000,000 Bytes per second
Gb/s = 1,000,000,000 bits per second
GB/s = 1,000,000,000 Bytes per second

MTU = Maximum Transfer Unit which is the largest the network packet can be not including the Ethernet header.
GRO = Generic Receive Offload
GSO = Generic Segmentation Offload, convert large packets to MTU size packets at the driver.
LRO = Large Receive Offload
RTT = Round-Trip Time, The time it takes to send a packet to the destination and to get a response back.
SKB = Socket Kernel Buffers
TSO = TCP Segmentation Offload, also seen as GSO within ethtool
UFO = UDP Fragmentation Offload
Xoff = Transmitter off, Tells the sender to stop sending packets.
Xon = Transmitter on, Tells the sender to start sending packets.

Appendix A
Basic Network Throughput
The maximum network performance is primarily dependent on the speed of the interface. The throughput of an interface is always less than the physical speed of the interface. For example, if the interface speed is listed as 1 Gb, the throughput is expected to be no better than 99.3% of 1 Gb/s which is 993 Mb/s or 124 MB/s. This is impacted by network configuration and the network use. For example, these numbers are with MTU of 9000. If the MTU is set to 1500, which is the default for TCP, the maximum throughput for 1 Gb/s interface would be 95.7% or 956.7 Mb/s which is 119.6 MB/s.

Appendix B
Autoperformance adjustments by the Linux kernel
Byte Queue Limits (BQL) is a new feature in recent Linux kernels (> 3.3.0) that attempts to solve the issue of driver queue sizing automatically. This is accomplished by adding a layer that enables and disables queueing to the driver queue based on calculating the minimum queue size that is required to avoid starvation under the current system conditions. The smaller the amount of queued data, the lower the maximum latency experienced by queued packets.

It is key to understand that the actual size of the driver queue is not changed by BQL. Rather, BQL calculates a limit of how much data (in bytes) can be queued at the current time. Any bytes over this limit must be held or dropped by the layers above the driver queue.

A real-world example may help provide a sense of how BQL affects the amount of data that can be queued. On one test server, the driver queue size defaults to 256 descriptors. Since the Ethernet MTU is 1,500 bytes, this means up to 256 * 1,500 = 384,000 bytes can be queued to the driver queue (TSO, GSO, and so forth, are disabled, or this would be higher). However, the limit value that is calculated by BQL is 3,012 bytes. As you can see, BQL greatly constrains the amount of data that can be queued.

BQL reduces network latency by limiting the amount of data in the driver queue to the minimum required to avoid starvation. It also has the important side effect of moving the point where most packets are queued from the driver queue, which is a simple FIFO, to the queueing discipline (QDisc) layer, which is capable of implementing more complicated queuing strategies.

Appendix C
Packet sizing and data throughput calculations.
1500 MTU
A packet with timestamp = 1448 bytes of data, 52 bytes of header, 18 bytes for Ethernet header, 4 bytes for VLAN tagging, and 8 bytes of preamble and start = 1526 and 1530 bytes with VLAN
A packet without timestamp = 1460 bytes of data, 40 bytes of header, 18 bytes for Ethernet header, 4 bytes for VLAN tagging, and 8 bytes of preamble and start

Sending at 1 Gb/s
655,307.9947575 packets per second =
653,594.7712418 packets per second ~ 954,248,366 b/s

MTU 1500 w/1 Gb/s
A packet without timestamp and without VLAN tagging = 1526 total bytes (12,208 bits) and 1460 data bytes (11,680 bits)
        81,913.4993 packets per second = 956,749,672 data bits/second = 119,593,709 data Bytes/second
A packet without timestamp and w/VLAN tagging = 1530 total bytes (12,240 bits) and 1460 data bytes (11,680 bits)
        81,699.3464 packets per second = 954,248,366.01307 data bits/second = 119,281,045.7516 data Bytes/second
A packet with timestamp and without VLAN tagging = 1526 total bytes (12,208 bits) and 1448 data bytes (11,584 bits)
        81,913.4993 packets per second = 948,885,976.4089 data bits/second = 118,610,747.05111 data Bytes/second
A packet with timestamp and w/VLAN tagging = 1530 total bytes (12,240 bits) and 1448 data bytes (11,584 bits)
        81,699.3464 packets per second = 946,405,228.758 data bits/second = 118,300,653.59477 data Bytes/second

MTU 9000 w/1 Gb/s
A packet without timestamp and without VLAN tagging = 9026 total bytes (72,208 bits) and 8960 data bytes (71,680 bits)
        13,848.881 packets per second = 992,687,790.8265 data bits/second = 124,085,973.85 data Bytes/second

A packet without timestamp and with VLAN tagging = 9030 total bytes (72,240 bits) and 8960 data bytes (71,680 bits)
        13,842.7464 packets per second = 992,248,062.0155 data bits/second = 124,031,007.7519 data Bytes/second

A packet with timestamp and without VLAN tagging = 9026 total bytes (72,208 bits) and 8948 data bytes (71,584 bits)
        13,848.881 packets per second = 991,358,298.2495 data bits/second = 123,919,787.281 data Bytes/second

A packet with timestamp and with VLAN tagging = 9030 total bytes (72,240 bits) and 8948 data bytes (71,584 bits)
        13,842.7464 packets per second = 990,919,158.361 data bits/second = 123,864,894.7951 data Bytes/second

MTU 1500 with 10 Gb/s, speeds change by an order of magnitude, for example:

A packet without timestamp and without VLAN tagging = 1526 total bytes (12,208 bits) and 1460 data bytes (11,680 bits)
        819,134.993 packets per second = 9,567,496,723.46 data bits/second = 1,195,937,090.4325 data Bytes/second

MTU 9000 with 10 Gb/s, speeds change by an order of magnitude, for example:

A packet without timestamp and without VLAN tagging = 9026 total bytes (72,208 bits) and 8960 data bytes (71,680 bits)
        138,488.81 packets per second = 9,926,877,908.265 data bits/second = 1,240,859,738.5 data Bytes/second

MTU 9000 with 100 Gb/s, speeds change by two orders of magnitude, for example;

A packet without timestamp and without VLAN tagging = 9026 total bytes (72,208 bits) and 8960 data bytes (71,680 bits)
        1,384,888.1 packets per second = 99,268,779,082.65 data bits/second = 12,408,597,385 data Bytes/second

Additional Information

This content is translated in 17 languages:

https://downloads.dell.com/TranslatedPDF/CS_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/DA_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/DE_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/ES-XL_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/FI_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/FR_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/IT_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/JA_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/KO_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/NL_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/NO-NO_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/PL_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/PT-BR_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/RU_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/SV_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/TR_KB541073.pdf

https://downloads.dell.com/TranslatedPDF/ZH-CN_KB541073.pdf

Affected Products

Data Domain

Products

Data Domain

Data Domain Network Performance and Configuration

Summary: Data Domain Network Performance and Configuration

Instructions

Additional Information

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

Data Domain Network Performance and Configuration

Summary: Data Domain Network Performance and Configuration

Detailed Article

Instructions

Additional Info

Affected Products

Instructions

Additional Information

Affected Products

Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services