Highlighted
8 Krypton

Ask the Expert summary: NAS System Performance Optimization

Ask the Expert summary: NAS System Performance Optimization

Introduction

This article summaries 2012 Chinese ATE activity: "NAS System Performance Optimization". The original thread is https://community.emc.com/thread/146428.

Detailed information

Question 1:

What factors affect NAS performance?

Answer:

In general, there are five basic factors that can influence NAS performance.

  • Host: The host is the common bottleneck for NAS performance. The server simply cannot handle data to or from the back-end storage quickly enough. A busy server may be so flooded with packets that it cannot receive all of them, or it cannot queue the incoming requests in a protocol-specific structure once the network interface receives the packet.

  • Network: An overly congested network slows down both client transmissions and server replies. If the delays caused by network congestion are serious, they contribute to timeouts. It is very important to set TCP windows and retransmission rate for improving NAS performance.

  • NAS devices: Basically, NAS is unlikely to become a performance bottleneck unless there is a software bug, because the memory and CPU in the NAS device can meets the basic demand. However, some applications affect the NAS performance, such as the file system snapshot (ckpt), Virus protection (CAVA), deduplication, archive, file system replication and so on.

  • File system layout: The system offers flexible volumes and file systems management. There are a variety of volume types and configurations from which you can choose to optimize your file system’s storage potential. You can divide, combine, and group volumes to meet your specific configuration needs. The best deployment is to use all available disk volumes with user-defined storage pools. The documents of “Volumes and File Systems Manually” and “Volumes and File Systems AVM” are for your reference.

  • Back-end storage: It’s the major bottleneck when the disk is overloaded. Other bottleneck reasons could be improper setting for RAID type, disk type, disk number, cache settings, high / low water mark, LUN in a different distribution of SP, FAST cache and FAST VP.


Question 2:

Could you please introduce the steps for troubleshooting NAS performance issue?

Answer:

NAS performance is a complicated issue. First, you should understand the issue before troubleshooting. Find out which file systems are being affected, is it one or many file systems? Are they all on the same DM, RG, LUN, etc.? Use the performance checklist as a good source of the questions to be asked.

After that, you may collect the relevant logs, such as support materials, SPCOLLECT, NAR files, Network package, tcpdumpsand so on.

Finally, you should analyze the log to look for the bottlenecks. This requires a strong basic skill, such as understanding CIFS, NFS, TCP and other protocols. You can find the solution after you have completed the analysis. For example: finding the root cause of a network congestion, may require user to upgrade their network equipment, or to force the sender to decrease the TCP Send Window.


Question 3:

What kind of tools can monitor NAS performance?

Answer:

Selecting the proper management tool is crucial to NAS monitoring since it can facilitate all the management tasks, reduce NAS device downtime, and further make the NAS device easy to use, maintain, and expand. Here are some NAS monitoring tools.

Back-end storage: collecting NAR file and analyzing the NAR file.


NAS devices:

  • For VNX, I recommend to open “statistic for file” in Unisphere.

  • For Celerra, I recommend to open “Celerra Monitor” in Unisphere. Meanwhile, the command of server_statsdisplays sets of statistics that are running on the specified Data Mover.

Network package: Wireshark tool.

Host: Iometer.


NAS performance case sharing (1)

Description:

This case is from an issue in a Network TV company. The NAS write performance was very good, but the read performance was poor.

Analysis:

I captured the network packets using Wireshark in the Windows client, and found that the NAS read performance was good and the network traffic was not congested. But the network packets often encountered retransmission due to a disorder when the client received the network packets from network layer to the TCP layer. Packet retransmission may cause a big influence on NAS performance; even a 0.1% retransmission may seriously affect the performance.

Solution:

On the test environment, the customers chose to reinstall the Windows client. The network retransmission disappeared after reinstallation. Therefore the NAS performance has greatly improved.


NAS performance case sharing (2)

Description:

The NAS performance with a single client was good, but the performance was rapidly declining when multiple clients simultaneously read and wrote to the NAS device.

Analysis:

I captured the network packets using Wireshark in the Windows client, and noticed that the client status and network status were good. But the NAS read performance was rapidly declining when the client number is increasing. And then I found the file system was only built on a single LUN, after analyzing the log, which caused the poor performance.  

Solution:

Build a LUN on different RAID groups to make a strip volume then build a File System on the created strip volume.


NAS performance case sharing (3)

Description:

There were two file systems in the NAS, but the performance of file system laserCT500test was better than the performance of file system LaserCT500.

Analysis:

File system LaserCT500:


LaserCT500 built on four LUNs with RAID10. Each of the two LUNs located in two different RAID groups, all LUN are belong to SPA.

                        a_1.jpg

The user hand-picked four LUNs to build a file system instead of using the AVM tool. 

a_2.jpg

File system LaserCT500test:

LaserCT500test volume and file systems was created with the AVM tool, the AVM picked four LUN with different RAID 5 (4 +1) for LaserCT500test, so the file system is located in the 5 * 4 = 20 disks.

a_3.jpg

a_4.jpg

Solution:

The conclusion is presented from the analysis results, the file system LaserCT500 created volumes and file systems on an irrational volume structure. The utilization ratio of four disks is 100% and a single disk I/O exceeds 320/s. Generally speaking, the single disk I/O of 15K RPM FC disk should not exceed 180/s. I strongly recommend that the user create volumes and file systems on a rational volume structure.


NAS performance case sharing (4)

Description:

The VNX NFS file system used VMware Datastoresin a cloud computing company. It previously deployed VM via a VM Template in 10 minutes, but in recent deployments the time was suddenly increased to 20 minutes.

Analysis:

File system layout performance:

The NFS file system was built on 8 different LUNs with RAID 1/0 and each RAID group has two SAS disks. The file system layout looks good.

Data Mover Performance:

DM CPU and memory were quite free. Some NFSv3 read/write response time was very long, sometimes even requires several seconds. When you try to deploy a VM, NFSv3 Write I/O was about 5000 ~ 8000 IOPS, and NFSv3 Read I/O was 9000~15,000 IOPS. Where did all those read I/Os come from? I found that the user's VM Template also comes from an NFS Datastore. So I deleted a relatively large VMDK, and then read I/O was reduced to 5000 IOPS.

Network Performance:

No packet loss and retransmission rate was very low.

VNX back-end storage performance:

The data of NAR Files showed that the NFS file system is built on 8 LUNs and the response time is very fast. However, we found that the dirty cache in the SPA has reached to 100% and the status of SPB is healthy. The SPA has to deal with double I/O number of compared to SPB. Also:

• The LUNs had been evenly distributed to SPA and SPB, but the I/O of LUNs which run on the SPA are more than SPB’s.

• The major operations on the SPA is to write.

• The most hosts with NL-SAS disks are overloaded.

 

In conclusion, the issue was caused by the overloaded disks on backend storage. It caused the write cache refresh to be slow in the SPA which in turn causes the NFS write operation to be slow.

Solution:

Solution A: The LUNs which has been used for the NFS file system on the SPA should be transferred to the SPB.

Solution B: (1). Analyze the workload of the LUN, and then average and distribute the workload for SPA and SPB.

                  (2). Add more disks to the storage, and then migrate the busy LUN to the new LUN to improve the NAS performance.

                  (3). Add additional disks to the Pool and evenly distribute data to all disks, the performance can be improved.

                  (4). Replace RAID 6 with RAID 5.

                  (5). Install NFS VAAI.

Author: Jeffey Liu

iEMC APJ

Please click here for for all contents shared by us.