The virtual tape library backup performance issues

Question

The virtual tape library backup performance issues

The virtual tape library backup performance issues

Introduction

Many users think that the Data Domain (DD) Virtial Tape Library (VTL) backup speed is relatively slow, even slower than the speed of the physical tape library backup. Therefore, they will doubt that DD is not as good as it’s supposed to be. I hope to remove that doubt through this post. I will help them find the root cause of these performance problems. In fact, more than 80% of the performance issues have nothing to do with Data Domain itself. However, if the implementation plan was not implemented properly, it could cause performance issues. Performance bottleneck could also happen due to excessive network pressure.

Next, I would like to share my personal experience. I hope it will be helpful for everyone. Performance tuning has always been an extremely complex and complicated system work because it involves a variety of operating system platforms and different communication protocols. I will only share based on my knowledge about some relevant points about VTL, without going too deep into details.

Detailed Information

Let’s take a look at what factors can affect the performance of a data backup and restore situation.

1. Backup server hardware configurations, including CPU, memory, hard drives and network cards;

2. Backup server's operating system;

3. Backup server daily work pressure;

4. Backup client's hardware configurations, including CPU, memory, hard drive and network card/optical port;

5. Backup client's operating system;

6. Backup client's daily work pressure;

7. Backup network hardware and configuration;

8. Backup network congestion;

9. Fiber storage hardware and network configuration;

10. Fiber network congestion;

11. Fiber transmission distance and switch interconnect bandwidth and hops;

12. Different communication protocols;

13. Communication protocol optimization;

14. The final backup device - tape library hardware and configuration

Isn’t that a little bit confusing? You must not have expected that the usual backup job would be so complicated, right? Let’s look at the diagram below, and see how the data flows from the VTL to the storage node and therefore deepen the understanding of the above various performance factors.

We must combine the performance analysis of various factors mentioned above with overall data flow, from the source of the data flow to the end figuring out where the bottleneck lies. Let’s use a mining transportation for example. Backup is like moving the coals from the Mine to the Warehouse.

Prior to transporting, the first digging machine is mining, and then unloads the coal into a trailer inside the truck. The truck will go through the specified route to the destination to unload the coal and then continue transporting. Appears to be very simple, right? In fact, the total transport window depends entirely on the needs of users. If user is not worried, you can pull from the mine so slowly by using a smaller truck, which is due to cost considerations. So if the customer requires urgency, and need these mines urgently, what do you do? Then make it quick, of course, this expedited fee is inevitable, and then it will involve more excellent performance engineering vehicles, like trucks. More digging machines are mining, and then upload the coals into larger trucks, which simultaneously transports to the designated warehouse, thus greatly reducing the transport window to meet customer demands. So, it’s the same for backup, as long as they meet your backup window, no fastest only faster if you want to achieve better backup performance, which means you have to invest more. Looking back at data backup, the rate of reading the source data (the size of a read is like many digging machines), the data that is transmitted to the backup server (truck), the transfer size and how much data streams can be transferred (size and number trucks) , and whether the transmission distance is near or far, with or without roadblocks in network, and many other factors can determine the backup window time. Mapping to the relevant terminology is: TCPWINDOW SIZE, SEND / RECEIVE BUFFER SIZE, BUFFER SIZE, BLOCK SIZE, MULTIPLE STREAMS, MUTIPLEXING and ISL BANDWIDTH, etc.

Let’s talk about each specific node:

First is the backup server. Backup Server is the data backup and recovery headquarters which controls all the resources and is responsible for coordinating the operation of related events. Therefore, it has to have a powerful hardware to support its busy daily transactions. This article does not specify server system kernel tuning, those details can be found in a different backup software vendor performance tuning guide. Do not overwork the server and make it too busy, otherwise it will affect the overall data backup / restore performance. We can use a specific command to see whether the server is too busy, for example: 'vmstat, sar, top ...' and so on. Additionally, network congestion needs attention as well, and you need check if it is necessary to use multiple network cards to make an aggregation, there is no delay for DNS server to resolve and so on. . . These will affect performance.

Second, the media server. A media server can directly communicate via optical fiber networks with the VTL, so that it can identify its assigned tape drive device from VTL. In the entire backup and recovery process, it plays a pivotal role, because it receives backup data streams from a network client while writing data to tape in from memory. In addition we also need to pay attention to the import and export interface. Import is the server 's network card with a multi-network port aggregation. Check if you have you increased tcpwindowsize and transfer buffer size? If it is using a Gigabit Ethernet, have you increased the MTU size? Export means there are a few fiber ports leading to the VTL with or without load balancing and so on. Is the default value of the frame size of the optical card big enough? For example, on Windows 32bit 2003/2008, the default frame size is only 64K, you need to adjust the registry, and install the appropriate drivers to adjust to 1M or more.

Next is the backup client. The client really needs to be backed up. Apart from the above mentioned server load and network bandwidth, you need stop the antivirus program in the backup process; otherwise it will run very slowly. Typically, the application server mounts the storage drives, so the RAID configuration and the LVM volume management are also very important. Good volume management tends to enhance the overall I/O response time.

Finally, let’s look at some aspects of backup software and database which can be worth our attention. We do not suggest enabling the compression and encryption functionality for application and databases, because this directly affects data compression and backup speed on Data Domain. In fact, DD itself also provides data compression and encryption services, so there is no need to turn on those features on the application side. Turning to multiple data streams, a backup set can also be routed to multiple tape devices, then how many streams would be reasonable? Usually it depends on the number of physical hard disks. A physical disk can be linked to a data stream. Today, in the RAID environment, appropriately increasing data stream can help improve performance, but sometimes too many data streams will reduce performance and occupy too much system resources. For those small files, we recommend using the snap image backup technology; the increased read buffer size can greatly improve efficiency. In addition to everything that has been mentioned, the important thing is the block size of the tape device, the default value of many backup vendors are only about 64K, so be sure to increase the size of the block, to at least 256K or more, which is particularly important.

After talking about the different nodes, let’s take a look at what areas need attention for communication protocols.

For the TCP/IP network, increasing the tcpwindowsize and buffer size can improve throughput in the process of data transmission through network.

· Oracle Solaris

· Set TCPIPWINDOW SIZE to 63k or higher

· Edit the file in_proto.c to adjust the following buffer size

· tcp_default_mss to 1500 MTU

· tcp_sendspace to 16KB or 32KB

· tcp_recvspace to 16KB or 32KB

· AIX - no (network option) - we can use the 'no' command to adjust the network parameters

· Use no-a to view current settings

· When using TCP window sizes ≥ 64, set rfc1323 to 1

· Here are the recommended values for the parameters described in this section:

· § lowclust = 200

· § lowmbuf = 400

· § thewall = 131072

· § mb_cl_hiwat = 1200

· § sb_max = 1310720

· § rfc1323 = 1

· Windows Platform

· WIN2008: [HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Services \ Tcpip \ Parameters] Tcp1323Opts, REG_DWORD, 3

· WINXP/2K3: [HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Services \ Tcpip \ Parameters]

· DefaultSendWindow "= dword: 1048576

· DefaultReceiveWindow "= dword: 1048576

· GlobalMaxTcpWindowSize "= dword: 1048576

· TcpWindowSize "= dword: 1048576

· Tcp1323Opts "= dword: 3

Linux-Check with "cat / proc/sys/net/ipv4/tcp_window_scaling", the default value should be greater than 64K

For the SAN Network:

1. First, ensure there is no physical port or fiber cable issue. For example: we can use the switch command "porterrshow" to check if the SFP has an error, such as 'crc error' orother errors. If you often see an error with a port, you can also check if the light intensity is enough by issuing the command 'sfpshow' (brocade). The recommended value is greater than -7dbm.

2. Whether the backup server and VTL are across multiple switches, it is recommended not to exceed three switches. In addition, it is important that the ISL bandwidth is big enough to use.

3. Long-distance transmissions require increasing the B2B credit buffer on the switch. This is equivalent to the tcp window size. You can transfer a big data packet to reduce the transferring overhead.

4. We recommend that the fiber ports on the host side are only connected to the VTL, without being shared. This can avoid unexpected communication failures.

5. Check for slow drain devices- which we call drag-type devices. For example in an 8G SAN with a 2G network node connected to it. This device will become a bottleneck, because it handles the data stream very slowly. While waiting for its response the other devices will have a performance decline.

6. Zoning configuration is very important. Multiple initiators in a zoning can sometimes cause performance problems because they will try to shake hands with each other to establish a connection without success, which will impact performance a little bit.

Lastly, let’s talk about what circumstances will Data Domain affect the performance.

1. DD has hardware problems, such as hard disk or memory problems.

2. In the event of a damaged hard drive and RAID rebuilding, it tends to consume a lot of system resources.

3. The garbage collection and copying processes are running at the same time. They will take up a lot of resources, resulting in backup speed decreases. We recommend that the backup window do not overlap with them.

4. System space is more than 85% full. DD will take more time to verify the data consistency.

5. VTL fibre port fails to do load balancing.

6. VTL has not been fully utilized. The number of concurrent data streams can be increased to improve the overall throughput.

7. DD is too busy and does not have many resources for fast I/O processing. We can use the command 'iostat 2' to monitor this.

In conclusion, for DD virtual tape library performance problems, you need to ensure that the DD itself has no problems, such as hardware problem, space usage, system resource load, fiber port load balancing.

All other bottlenecks are outside DD, the simplest thing is the tape device block size is not larger than 256k. There is no performance and configuration issues for fiber network, and the pressure of the the backup host, and so on. In summary, follow a one-way data flow when you do your investigations.

iEMC APJ

Please click here for for all contents shared by us.

dynamox · Answer

what values in iostat would indicate potential issues ?

ECN-APJ · Answer

Hi Dynamox,For iostat, it could tell us whether CPU is busy or not and if currently run anything like cleaning, disk rebuilding. See below output for a sample:First of all, we could check the state of DD (CDBVMSFIR)-C means cleaning is running and D means disk rebuilding is running. If one of them is running then we should expect a little bit slow performance than usual.And then check the NFS operation to see how busy is it, assuming it is running with high load then check If the CPU utilization is >= 80% or the disk utilization is >=60%. We might run out of disk or CPU if CPU and disk utilization is very high.Hope that helps. Thanks.

Data Domain

The virtual tape library backup performance issues

Introduction

Detailed Information

Was this post helpful?