Some article numbers may have changed. If this isn't what you're looking for, try searching all articles. Search articles

Article Number: 000175293

Dell EMC Ready Solution for HPC PixStor Storage - Capacity Expansion

Summary: Dell EMC Ready Solution for HPC PixStor Storage - Capacity Expansion

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content

Symptoms

Authored by Mario Gallegos of HPC and AI Innovation Lab in April 2020

Cause

None

Resolution

Table of Content

Introduction
1. Solution Architecture
2. Solution Components
Performance Characterization
1. Sequential IOzone Performance N clients to N files
2. Sequential IOR Performance N clients to 1 file
3. Random small blocks IOzone Performance N clients to N files
4. Metadata performance with MDtest using empty files
5. Metadata performance with MDtest using 4 KiB files
Conclusions and Future Work

Introduction

Today’s HPC environments have increased demands for very high-speed storage that also frequently requires high capacity and distributed access via several standard protocols like NFS, SMB and others. Those high demand HPC requirements are typically covered by Parallel File Systems that provide concurrent access to a single file or a set of files from multiple nodes, very efficiently and securely distributing data to multiple LUNs across several servers.

Solution Architecture

This blog is a continuation Parallel File System (PFS) solution for HPC environments, the DellEMC Ready Solution for HPC PixStor Storage, where PowerVault ME484 EBOD arrays are used to increase the capacity of the solution. Figure 1 presents the reference architecture depicting the capacity expansion SAS additions to the existing PowerVault ME4084 storage arrays.
The PixStor solution includes the widespread General Parallel File System also known as Spectrum Scale as the PFS component, in addition to many other Arcastream software components like advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities and many others.

SLN321192_en_US__1image001

Figure 1: Reference Architecture.

Solution Components

This solution is planned to be released with the latest Intel Xeon 2nd generation Scalable Xeon CPUs, a.k.a. Cascade Lake CPUs and some of the servers will use the fastest RAM available to them (2933 MT/s). However, due to current hardware available to work on the prototype of the solution to characterize performance, servers with Intel Xeon 1st generation Scalable Xeon CPUs a.k.a. Skylake processors and in some cases slower RAM were used to characterize this system. Since the bottleneck of the solution is at the SAS controllers of the DellEMC PowerVault ME40x4 arrays, no significant performance disparity is expected once the Skylake CPUs and RAM are replaced with the envisioned Cascade Lake CPUs and faster RAM. In addition, the solution was updated to the latest version of PixStor (5.1.1.4) that supports RHEL 7.7 and OFED 4.7 for characterizing the system.

Because of the previously described situation, Table 1 has the list of main components for the solution, but when discrepancies were introduced, the first description column has components used at release time and therefore available to customers, and the last column is the components actually used for characterizing the performance of the solution. The drives listed for data (12TB NLS) and metadata (960Gb SSD), are the ones used for performance characterization, and faster drives can provide better Random IOPs and may improve create/removal metadata operations.

Finally, for completeness, the list of possible data HDDs and metadata SSDs was included, which is based on the drives supported as enumerated on the DellEMC PowerVault ME4 support matrix, available online.

Table 1 Components used at release time and those used in the test bed

Solution Component		At Release	Test Bed
Internal Connectivity		Dell Networking S3048-ON Gigabit Ethernet
Data Storage Subsystem		1x to 4x Dell EMC PowerVault ME4084 1x to 4x Dell EMC PowerVault ME484 (One per ME4084) 80 – 12TB 3.5" NL SAS3 HDD drives Options 900GB @15K, 1.2TB @10K, 1.8TB @10K, 2.4TB @10K, 4TB NLS, 8TB NLS, 10TB NLS, 12TB NLS. 8 LUNs, linear 8+2 RAID 6, chunk size 512KiB. 4x 1.92TB SAS3 SSDs for Metadata – 2x RAID 1 (or 4 - Global HDD spares, if Optional High Demand Metadata Module is used)
Optional High Demand Metadata Storage Subsystem		1x to 2x Dell EMC PowerVault ME4024 (4x ME4024 if needed, Large config only) 24x 960GB 2.5" SSD SAS3 drives (Options 480GB, 960GB, 1.92TB) 12 LUNs, linear RAID 1.
RAID Storage Controllers		12 Gbps SAS
Capacity as configured		Raw: 8064 TB (7334 TiB or 7.16 PiB) Formatted ~ 6144 GB (5588 TiB or 5.46 PiB)
Processor	Gateway	2x Intel Xeon Gold 6230 2.1G, 20C/40T, 10.4GT/s, 27.5M Cache, Turbo, HT (125W) DDR4-2933	N/A
	High Demand Metadata		2x Intel Xeon Gold 6136 @ 3.0 GHz, 12 cores
	Storage Node		2x Intel Xeon Gold 6136 @ 3.0 GHz, 12 cores
	Management Node	2x Intel Xeon Gold 5220 2.2G, 18C/36T, 10.4GT/s, 24.75M Cache, Turbo, HT (125W) DDR4-2666	2x Intel Xeon Gold 5118 @ 2.30GHz, 12 cores
Memory	Gateway	12x 16GiB 2933 MT/s RDIMMs (192 GiB)	N/A
	High Demand Metadata		24x 16GiB 2666 MT/s RDIMMs (384 GiB)
	Storage Node		24x 16GiB 2666 MT/s RDIMMs (384 GiB)
	Management Node	12x 16GB DIMMs, 2666 MT/s (192GiB)	12x 8GiB 2666 MT/s RDIMMs (96 GiB)
Operating System		Red Hat Enterprise Linux 7.6	Red Hat Enterprise Linux 7.7
Kernel version		3.10.0-957.12.2.el7.x86_64	3.10.0-1062.9.1.el7.x86_64
PixStor Software		5.1.0.0	5.1.1.4
Spectrum Scale (GPFS)		5.0.3	5.0.4-2
High Performance Network Connectivity		Mellanox ConnectX-5 Dual-Port InfiniBand EDR/100 GbE, and 10 GbE	Mellanox ConnectX-5 InfiniBand EDR
High Performance Switch		2x Mellanox SB7800 (HA – Redundant)	1x Mellanox SB7700
OFED Version		Mellanox OFED-4.6-1.0.1.0	Mellanox OFED-4.7-3.2.9
Local Disks (OS & Analysis/monitoring)		All servers except Management node 3x 480GB SSD SAS3 (RAID1 + HS) for OS PERC H730P RAID controller Management Node 3x 480GB SSD SAS3 (RAID1 + HS) for OS PERC H740P RAID controller	All servers except Management node 2x 300GB 15K SAS3 (RAID 1) for OS PERC H330 RAID controller Management Node 5x 300GB 15K SAS3 (RAID 5) for OS & Analysis/monitoring PERC H740P RAID controller
Systems Management		iDRAC 9 Enterprise + DellEMC OpenManage	iDRAC 9 Enterprise + DellEMC OpenManage

Performance Characterization

To characterize this new Ready Solution, we used the hardware specified in the last column of Table 1, including the optional High Demand Metadata Module. In order to assess the solution performance, the following benchmarks were used:

IOzone N to N sequential
IOR N to 1 sequential
IOzone random
MDtest

For all benchmarks listed above, the test bed had the clients as described in the Table 2 below. Since the number of compute nodes available for testing was only 16, when a higher number of threads was required, those threads were equally distributed on the compute nodes (i.e. 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads =16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes. Since the benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), while avoiding excessive context switching and other related side effects from affecting performance results.

Table 2 Client Test bed

Number of Client nodes	16
Client node	C6320
Processors per client node	2 x Intel(R) Xeon(R) Gold E5-2697v4 18 Cores @ 2.30GHz
Memory per client node	12 x 16GiB 2400 MT/s RDIMMs
BIOS	2.8.0
OS Kernel	3.10.0-957.10.1
GPFS version	5.0.3

Sequential IOzone Performance N clients to N files

Sequential N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 1024 threads, and the results of the capacity expanded solution (4x ME4084s + 4x ME484s) are contrasted to the large size solution (4x ME4084s). Caching effects were minimized by setting the GPFS page pool tunable to 16GiB and using files bigger that two times that size. It is important to notice that for GPFS that tunable sets the maximum amount of memory used for caching data, regardless the amount of RAM installed and free. Also, important to notice is that while in previous DellEMC HPC solutions the block size for large sequential transfers is 1 MiB, GPFS was formatted with 8 MiB blocks and therefore that value is used on the benchmark for optimal performance. That may look too large and apparently waste too much space, but GPFS uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided in 256 subblocks of 32 KiB each.

The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and threadlist was the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

./iozone -i0 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist
./iozone -i1 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist

Figure 2: N to N Sequential Performance

From the results we can observe that performance rises very fast with the number of clients used and then reaches a plateau that is stable until the maximum number of threads that IOzone allow is reached, and therefore large file sequential performance is stable even for 1024 concurrent clients. Notice that both read and write performance benefited from the doubling the number of drives. The maximum read performance was limited by the bandwidth of the two IB EDR links used on the storage nodes starting at 8 threads, and ME4 arrays may have some extra performance available. Similarly notice that the maximum write performance increased from a maximum of 16.7 to 20.4 GB/s at 64 and 128 threads and it is closer to the ME4 arrays maximum specs (22 GB/s).

Here it is important to remember that GPFS preferred mode of operation is scattered, and the solution was formatted to use such mode. In this mode, blocks are allocated from the very beginning of operation in a pseudo-random fashion, spreading data across the whole surface of each HDD. While the obvious disadvantage is a smaller initial maximum performance, that performance is maintained fairly constant regardless of how much space is used on the file system. That in contrast to other parallel file systems that initially use the outer tracks that can hold more data (sectors) per disk revolution, and therefore have the highest possible performance the HDDs can provide, but as the system uses more space, inner tracks with less data per revolution are used, with the consequent reduction of performance.

Sequential IOR Performance N clients to 1 file

Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from a one thread up to 512 threads (since there were not enough cores for 1024 threads), and results are contrasted to the solution without the capacity expansion.
Caching effects were minimized by setting the GPFS page pool tunable to 16GiB and using files bigger that two times that size. This benchmark tests used 8 MiB blocks for optimal performance. The previous performance test section has a more complete explanation for those matters.

The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b 128G

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -r -s 1 -t 8m -b 128G

Figure 3: N to 1 Sequential Performance

From the results we can observe again that the extra drives benefit read and write performance. Performance rises again very fast with the number of clients used and then reaches a plateau that is fairly stable for reads and writes all the way to the maximum number of threads used on this test. Notice that the maximum read performance was 24.8 GB/s at 16 threads and the bottleneck was the InfiniBand EDR interface, with ME4 arrays still had some extra performance available. From that point, read performance decreased from that value until reaching the plateau at around 23.8 GB/s. Similarly, notice that the maximum write performance of 19.3 was reached at 8 threads and reach a plateau.

Random small blocks IOzone Performance N clients to N files

Random N clients to N files performance was measured with FIO version 3.7 instead of the traditional Iozone. The intention, as listed in the previous blog, was of take advantage of a larger Queue Depth to investigate the maximum possible performance that ME4084 arrays can deliver (previous tests for different ME4 solutions showed the ME4084 arrays need more IO pressure that Iozone can deliver to reach their random IO limits).

Tests executed varied from single thread up to 512 threads since there was not enough client-cores for 1024 threads. Each thread was using a different file and the threads were assigned round robin on the client nodes. This benchmark tests used 4 KiB blocks for emulating small blocks traffic and using a queue depth of 16. Results from the large size solution and the capacity expansion are compared.

Caching effects were again minimized by setting the GPFS page pool tunable to 16GiB and using files two times that size. The first performance test section has a more complete explanation about why this is effective on GPFS.

SLN321192_en_US__4image007

Figure 4: N to N Random Performance

From the results we can observe that write performance starts at a high value of 29.1K IOps and rises steadily up to 64 threads where it seems to reach a plateau at around 40K IOps. Read performance on the other hand starts at 1.4K IOps and increases performance almost linearly with the number of clients used (keep in mind that number of threads is doubled for each data point) and reaches the maximum performance of 25.6K IOPS at 64 threads where seems to be close to reaching a plateau. Using more threads will require more than the 16 compute nodes to avoid resources starvation and a lower apparent performance, where the arrays could in fact maintain the performance.

Metadata performance with MDtest using empty files

Metadata performance was measured with MDtest version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 512 threads. The benchmark was used for files only (no directories metadata), getting the number of creates, stats, reads and removes the solution can handle, and results were contrasted with the Large size solution.

To properly evaluate the solution in comparison to other DellEMC HPC storage solutions and the previous blog results, the optional High Demand Metadata Module was used, but with a single ME4024 array, even that the large configuration and tested in this work was designated to have two ME4024s. This High Demand Metadata Module can support up to four ME4024 arrays, and it is suggested to increase the number of ME4024 arrays to 4, before adding another metadata module. Additional ME4024 arrays are expected to increase the Metadata performance linearly with each additional array, except maybe for Stat operations (and Reads for empty files), since the numbers are very high, at some point the CPUs will become a bottleneck and performance will not continue to increase linearly.

The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. Similar to the Random IO benchmark, the maximum number of threads was limited to 512, since there are not enough cores for 1024 threads and context switching would affect the results, reporting a number lower than the real performance of the solution.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F

Since performance results can be affected by the total number of IOPs, the number of files per directory and the number of threads, it was decided to keep fixed the total number of files to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed as shown in Table 3.

Table 3: MDtest Distribution of files on directories

Number of Threads	Number of directories per thread	Total number of files
1	2048	2,097,152
2	1024	2,097,152
4	512	2,097,152
8	256	2,097,152
16	128	2,097,152
32	64	2,097,152
64	32	2,097,152
128	16	2,097,152
256	8	2,097,152
512	4	2,097,152
1024	2	2,097,152

Figure 5: Metadata Performance - Empty Files

First, notice that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences several orders of magnitude; otherwise some of the operations would look like a flat line close to 0 on a normal graph. A log graph with base 2 could be more appropriate, since the number of threads are increased in powers of 2, but the graph would look very similar and people tend to handle and remember better numbers based on powers of 10.
The system gets very good results with Stat and Read operations reaching their peak value at 64 threads with almost 11M op/s and 4.7M op/s respectively. Removal operations attained the maximum of 170.6K op/s at 16 threads and Create operations achieving their peak at 32 threads with 222.1K op/s. Stat and Read operations have more variability, but once they reach their peak value, performance does not drop below 3M op/s for Stats and 2M op/s for Reads. Create and Removal are more stable once their reach a plateau and remain above 140K op/s for Removal and 120K op/s for Create. Notice the extra drives do not affect much most of the metadata operations on empty files as expected.

Metadata performance with MDtest using 4 KiB files

This test is almost identical to the previous one, except that instead of empty files, small files of 4KiB were used.
The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F -w 4K -e 4K

SLN321192_en_US__6image011
Figure 6: Metadata Performance - Small files (4K)

The system gets very good results for Stat and Removal operations reaching their peak value at 256 threads with 8.2M op/s and 400K op/s respectively. Read operations attained the maximum of 44.8K op/s and Create operations achieving their peak with 68.1K op/s, both at 512 threads. Stat and Removal operations have more variability, but once they reach their peak value, performance does not drop below 3M op/s for Stats and 280K op/s for Removal. Create and Read have less variability and keep increasing as the number of threads grows. As can be observed, the extra drives of the capacity expansions only provide marginal changes in metadata performance.
Since these numbers are for a metadata module with a single ME4024, performance will increase for each additional ME4024 arrays, however we cannot just assume a linear increase for each operation. Unless the whole file fits inside the inode for such file, data targets on the ME4084s will be used to store the 4K files, limiting the performance to some degree. Since the inode size is 4KiB and it still needs to store metadata, only files around 3 KiB will fit inside and any file bigger than that will use data targets.

Conclusions and Future Work

The solution with expanded capacity was able to improve performance, not only for random accesses, but even for sequential performance. That was expected since the scattered mode behaves as randomized accesses and having more disks allows the improvement. That performance, which can be overviewed on Table 4, is expected to be stable from an empty file system until is almost full. Furthermore, the solution scales in capacity and performance linearly as more storage nodes modules are added, and a similar performance increase can be expected from the optional high demand metadata module. This solution provides HPC customers with a very reliable parallel file system used by many Top 500 HPC clusters. In addition, it provides exceptional search capabilities, advanced monitoring and management, and adding optional gateways allow file sharing via ubiquitous standard protocols like NFS, SMB and others to as many clients as needed..

Table 4 Peak and Sustained Performance

	Peak Performance		Sustained Performance
	Write	Read	Write	Read
Large Sequential N clients to N files	20.4 GB/s	24.2 GB/s	20.3 GB/s	24 GB/s
Large Sequential N clients to single shared file	19.3 GB/s	24.8 GB/s	19.3 GB/s	23.8 GB/s
Random Small blocks N clients to N files	40KIOps	25.6KIOps	40.0KIOps	19.3KIOps
Metadata Create empty files	169.4K IOps		123.5K IOps
Metadata Stat empty files	11M IOps		3.2M IOps
Metadata Read empty files	4.7M IOps		2.4M IOps
Metadata Remove empty files	170.6K IOps		156.5K IOps
Metadata Create 4KiB files	68.1K IOps		68.1K IOps
Metadata Stat 4KiB files	8.2M IOps		3M IOps
Metadata Read 4KiB files	44.8K IOps		44.8K IOps
Metadata Remove 4KiB files	400K IOps		280K IOps

Since the solution is intended to be released with Cascade Lake CPUs and faster RAM, once system has the final configuration, some performance spot checks will be done. And test the optional High Demand Metadata Module with at least 2x ME4024s and 4KiB files is needed to better document how metadata performance scales when data targets are involved. In addition, performance for the gateway nodes will be measured and reported along with any relevant results from the spot checks in a new blog or a white paper. Finally, more solution components are planned to be tested and released to provide even more capabilities.

Dell EMC Ready Solution for HPC PixStor Storage - Capacity Expansion

Summary: Dell EMC Ready Solution for HPC PixStor Storage - Capacity Expansion

Article Content

Symptoms

Cause

Resolution

Table of Content

Introduction

Solution Architecture

Solution Components

Performance Characterization

Sequential IOzone Performance N clients to N files

Sequential IOR Performance N clients to 1 file

Random small blocks IOzone Performance N clients to N files

Metadata performance with MDtest using empty files

Metadata performance with MDtest using 4 KiB files

Conclusions and Future Work

Article Properties

Affected Product

Last Published Date

Version

Article Type

Welcome

Welcome to Dell

Dell EMC Ready Solution for HPC PixStor Storage - Capacity Expansion

Summary: Dell EMC Ready Solution for HPC PixStor Storage - Capacity Expansion

Article Content

Symptoms

Cause

Resolution

Table of Content

Introduction

Solution Architecture

Solution Components

Performance Characterization

Sequential IOzone Performance N clients to N files

Sequential IOR Performance N clients to 1 file

Random small blocks IOzone Performance N clients to N files

Metadata performance with MDtest using empty files

Metadata performance with MDtest using 4 KiB files

Conclusions and Future Work

Article Properties

Affected Product

Last Published Date

Version

Article Type