Figure 1 shows the reference architecture of the solution. The topmost server is a Dell EMC PowerEdge R640, a management server, which runs Integrated Manager for Lustre or IML 220.127.116.11.
Furthermore, 2 x PowerEdge R740s are used as Metadata Servers (MDS). This MDS pair is configured in an active-active (in case of DNE) or active-passive (no DNE) for High availability (HA). The MDS pair is attached to a PowerVault ME4024, a 2U storage array, via 12Gbps SAS links which host the metadata targets (MDTs).
We then have 2 x PowerEdge R740s used as Object Storage Servers (OSS). This OSS pair is configured in an active-active HA. The OSS pair is attached to 4 x PowerVault ME4084, 5U storage arrays, via 12Gbps SAS links. The 4 x PowerVault ME4084 storage arrays host the object storage targets (OSTs) for the Lustre file system.
At the time of release, the HPC interconnect that will be used in the solution for Lustre network (LNet) will be InfiniBand EDR. This solution will also support Intel OmniPath for LNet which will be validated and performance tested in the near future.
Table 1 briefly describes the hardware specifications and software versions validated for the solution. For more information, refer to the complete solution configuration guide which your Dell sales representative can make available to you.
The new ME4 arrays impose a restriction of a maximum of 16 drives that could be added in a linear RAID-10 . On the ME4024, which we use as Metadata storage in this solution, we came up with the below two options for this release given the restriction.
1) Fully populated array with 24 drives: An optimal way to make use of all 24 drives for metadata is by creating 2 MDTs and using the DNE feature. With this option, we have 2 equally sized MDTs with regard to capacity. Each MDT is a linear RAID-10 of 10 drives, and the remaining 4 drives in the array are configured as global hot spares across the 2 MDTs. This would be a good option to choose if higher metadata performance is a priority and adding all 24 drives in the array is not much of a concern.
2) Half populated array with 12 drives: By only populating 50% of the array and creating a single MDT (no DNE), we have 1 x linear RAID-10 of 10 drives and 2 drives as hot spares. This would be an option to consider if DNE or higher metadata performance is not a priority. It could also be a good option as a tradeoff between performance and cost of fully populating the array. This option still achieves a high metadata performance, although not as high as option 1.
Please refer to Performance Evaluation section for more details on the metadata performance of the two options and a comparison between the two. For this release, and for purposes of validation and performance testing, 960GB SAS SSDs are considered in the ME4024.
Linear RAID-6 disk groups of 10 drives (8 +2) each are created on each array. This will result in 8 disk groups per array. Since each array has 84 drives, after creating 8 x RAID-6 disk groups, we have 4 drives which can be configured as global hot spares across the 8 volumes in the array. Out of every disk group, a single volume using all of the space is created.
As a result, there will be a total of 32 x RAID-6 volumes across 4 x ME4084 in a base configuration shown in Figure 1. Each of these RAID-6 volumes will be configured as an OST for the Lustre file system, resulting in a total of 32 OSTs across the file system.
IML is an intuitive web interface GUI maintained by Whamcloud which reduces complexity involved with installation, configuration, and maintenance . While it helps in maintenance and upgrades by allowing heath monitoring of different file system components, it also helps monitor file system performance by displaying real time performance charts on the dashboard. IML helps to configure and maintain HA on the Lustre servers as well as Lustre network or LNet.
This section presents an initial performance evaluation that helps characterize Dell EMC Ready Solution for HPC Lustre with ME4 and EDR. For further details and updates, please look for a white paper that will be published at a later time.
The goal is to study the system based on data performance as well as metadata performance. The solution is tested for sequential read and write throughput, random read and write IOPS, and metadata operations. Table 2 describes the client cluster configuration used as Lustre clients for performance studies presented in this blog.
Table 2: Client cluster configuration
||Red Hat Enterprise Linux (RHEL) 7.5
||Dell PowerEdge C6320
|Mellanox OFED version
|Lustre File system
|Number of physical nodes
||Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
7. Sequential Reads and Writes N-N
To evaluate sequential reads and writes the IOzone
benchmark, version 3.465 was used in the sequential read and write mode. These tests were conducted on multiple thread counts starting at 1 thread and increasing in powers of 2, up to 256 threads. At each thread count, number of files equal to thread count were generated, since this test works on one file per thread or the N-N case. The threads were distributed across 32 physical client nodes in a round robin fashion.
Throughput results were converted to GB/s from kB/s as printed by the tool. Aggregate file size of 2TB was selected which was equally divided among the number of threads within any given test. The aggregate file size was chosen large enough to minimize the effects of caching from the servers as well as from Lustre clients. OS caches were also dropped or cleaned on the client nodes between tests and iterations as well as between writes and reads. For all of these tests, Lustre stripe size was chosen to be 1MB. Stripe count was chosen to be 1 for threads >= 32, and for thread count < 32 the files were striped across all 32 OSTs with stripe count = 32. In addition to these choices, there were some Lustre client side tunings that were applied on this configuration and for this workload (shown below).
lctl set_param osc.*.checksums=0
lctl set_param timeout=600
lctl set_param at_min=250
lctl set_param at_max=600
lctl set_param ldlm.namespaces.*.lru_size=2000
lctl set_param osc.*OST*.max_rpcs_in_flight=256
lctl set_param osc.*OST*.max_dirty_mb=1024
lctl set_param osc.*.max_pages_per_rpc=1024
Commands used for the Sequential N-N
Sequential Writes : iozone -i 0 -c -e -w -r 1024K -s $Size -t $Thread -+n -+m /path/to/threadlist
Sequential Reads : iozone -i 1 -c -e -w -r 1024K -s $Size -t $Thread -+n -+m /path/to/threadlist
Figure 2: Sequential IOzone 2TB Aggregate Data Size
In Figure 2, we see that peak throughput of the system is attained at 32 threads. The peak write is 21.27GB/s and peak read is 22.56 GB/s.
The single thread write performance is observed to be 622MB/s and read at 643 MB/s. We see that the performance scales almost exponentially up to 32 threads or until the system attains its peak. After this, we see that writes saturate as we scale, and reads dip at 64 threads saturate thereafter. This brings us to understand that the overall sustained performance of this configuration for reads as well as writes is ≈ 20GB/s with the peaks as mentioned above.
We also see that the reads are very close to or slightly lower than the writes as the thread count increases. We ran server-storage tests (results not included in this blog) using obdfilter-survey, a tool included in the Lustre distribution. This helped us understand that the read performance of this configuration could be better if there were more threads than objects (N-M case, N:threads, M:objects and N > M) ,thereby presenting a higher queue depth at the ME4 controllers.
8. Random Reads and Writes N-N
To evaluate random IO performance, IOzone
version 3.465 was used in the random mode. Tests were conducted on thread counts starting from 16 up to 256 in powers of 2. Aggregate file size was chosen to be 1TB across all thread counts to minimize caching effects, and it was equally divided amongst all threads in any given test. The IOzone host file is arranged to distribute the workload evenly across the compute nodes. Lustre stripe count of 1 and stripe size of 4MB was used. A 4KB request size is used because it aligns with Lustre’s 4KB file system block size and is representative of small block accesses for a random workload. Performance is measured in I/O operations per second (IOPS). The OS caches were dropped between the runs on the Lustre servers as well as Lustre clients.
Command used random reads and writes : iozone -i 2 -w -c -O -I -r 4K -s $Size -t $Thread -+n -+m /path/to/threadlist
Figure 3: Random read and write performance
In Figure 3, we see that the random writes peak at 256 threads with 28.53K IOPS, slowly increasing after 32 threads, while random reads show a steady incline as threads increase and a peak of 34.06K IOPS at 256 threads. The IOPS of random reads increase rapidly from 32 to 256 threads. Although, we do see that the write IOPs are better than the read IOPS for thread counts < 256.
By our interaction and support from the storage team, who helped analyze storage array logs for the random IO cases, this behavior of the ME4 arrays for this kind of workload was as expected. Based on their explanation, we understand that the ME4 arrays will be optimized with higher queue depth at the controller level for both reads and writes, which means that if there are a higher number of IO requests at a time; the performance is also higher. As far as writes are concerned, because write-back cache is used, the caching at the controller makes it possible for a higher number of commands to be queued at the controller. This results in more write operations in flight, than the reads. This could be one of the influencing factors for writes to out-perform reads at lower thread counts.
However, as queues get larger, the read-modify-write nature of the writes hits a point where they stop increasing before the reads do the same. This is why we see the reads getting better by a higher factor than writes as we scale.
9. Metadata Performance Study
To evaluate the metadata performance of the system, MDTest
tool version 1.9.3 was used. The MPI distribution used was Intel MPI. As described in the metadata targets section, both options described were tested and compared.
While using DNE with 2 MDTs, directory striping was used. The distribution of subdirectories within this parent directory was configured in a round robin fashion. Once these tests were completed, we also tested a single MDT case without DNE. A comparison of metadata performance results for both a single MDT as well as 2 MDTs with directory striping are presented below.
We used the 32 Lustre clients described in Table 2 for metadata tests as well. MDTest was used to analyze the performance of file metadata by recording file create, file stats, and file remove operations. In order to understand how well the system scales and to compare the different thread cases on similar ground, we tested from a single thread case up to 1024-thread case with a consistent 2 million file count for each case. Table 3 below describes the number of files per directory and number of directories per thread for every thread count. The same table and testing methodology has been used for both 2 MDT as well as single MDT test cases. Three iterations of each test have been run and the mean value has been recorded.
Table 3: MDTest files and directory distribution across threads
||#Files per directory
||# Directories per thread
||Total number of files
Command used for the test:
mpirun -np $Threads -rr --hostfile /share/mdt_clients/mdtlist.$Threads /share/mdtest/mdtest.intel -v -d /mnt/lustre/perf_test24-1M -i $Reps -b $Dirs -z 1 -L -I $Files -y -u -t -F
Figure 4: File MDTest 2M files, 2 MDTs in DNE
Figure 4 shows file metadata stats for the 2 MDT case. The peak file create was found to be ≈ 90K Ops/sec , peak file stat ≈ 1237K Ops/sec and file remove ≈ 596K Ops/sec. We see that there is a linear increase in almost all the 3 recorded file operations as we scale from 1 node to 1024 nodes. We also see that file stat operations scale better than the remove and create since these are the lightest metadata operations; conversely, file create and file remove operations are slower than the stat operations since OSTs are involved.
Figure 5: File MDTest 2M files, 1 MDT or no DNE
Figure 5 shows the results of metadata performance on a single MDT with no DNE. The file create operations peak at ≈ 60K Ops/sec, file remove operations at ≈ 239K Ops/sec and file stat operations ≈ 668K Ops/sec.
Figure 6: File create operations: 1 MDT versus 2 MDTs
Figure 6, is a comparison of file create operations between a single MDT versus 2 MDTs in DNE. We can see that both scale similarly. However, the file create operations with 2 MDTs peak at around 90K ops/sec, whereas with a single MDT, they peak at around 60K ops/sec. We can clearly see that file create operations show 50% improvement with 2 MDTs compared to a single MDT because of the directory striping feature.
Figure 7: File stat operations: 1 MDT versus 2 MDTs
Figure 7 shows a comparison of file stat operations between a single MDT versus 2 MDTs. We can see that both cases are almost similar to each other until it reaches 64 threads. Starting at 64 threads, we do see a difference and as we scale further, the delta increases. The peak of file stat with a single MDT is around 668K Ops/sec whereas the peak with 2 MDTs is 1237K Ops/sec. The 2 MDT case with DNE shows an 85.2% improvement in file stat operations when compared to the single MDT case.
Figure 8: File remove operations: 1MDT versus 2 MDT
Figure 8 shows a comparison of file remove operations between a single MDT versus 2 MDTs. We see a difference between the two as we scale from 1 thread to 1024 threads. The delta is more evident starting 16 threads. We have the peak for a single MDT at around 239K Ops/sec; whereas, for 2 MDTs with directory striping, we have the peak around 596K Ops/sec at 1024. We see around 149.5% improvement in file remove Ops/sec with 2 MDTs in DNE compared to a single MDT.
10. Conclusion and Future Work
The Dell EMC Ready Solution for HPC Lustre Storage refresh using PowerVault ME4 has been introduced and initial performance characterization has been presented. This solution, as we have seen, targets to meet requirements of higher performance and scalability with a high sustained sequential throughput around 20GB/s. We also see high random writes peak at 28.53K IOPS and random reads at 34.06K IOPS. Metadata performance in this configuration takes advantage of the SSDs in the MDTs as well as the 2 MDT DNE configuration, showing 90K ops/sec for file create, 1237K ops/sec for file stat, and 596K ops /sec for file remove operations.
As a part of the next steps, we are going to be validating and evaluating performance of this solution with Intel OmniPath serving as the Lustre network. A white paper describing the solution with both Mellanox InifiniBand EDR (presented in this blog) and Intel OmniPath is expected to be published after the validation and evaluation process is complete.
 Dell EMC Storage for HPC with Lustre: Solution Configuration Guide
* Please contact your DellEMC sales representative for this document
 IML: http://wiki.lustre.org/Integrated_Manager_for_Lustre
 ME4 Series Storage System Administrator’s Guide https://www.dell.com/support/home/product-support/product/powervault-me4024/manuals
Mario Gallegos from HPC Engineering team for his guidance in efficiently adapting ME4 into the Lustre solution
Whamcloud Lustre Engineering team for their support on Lustre related questions
Seagate Customer Integration Engineering team for their support on storage related questions