Número del artículo: 000130558

Dell EMC Ready Solution for HPC PixStor Storage - NVMe Tier

Resumen: Blog for an HPC Storage Solution component, including architecture along with performance evaluation.

Contenido del artículo

Síntomas

Authored by Mario Gallegos of HPC and AI Innovation Lab in June 2020
Blog for an HPC Storage Solution component, including architecture along with performance evaluation.

Resolución

Dell EMC Ready Solution for HPC PixStor Storage

NVMe Tier

Table of Contents

Introduction. 1

Solution Architecture. 1

Solution Components. 1

Performance Characterization. 1

Sequential IOzone Performance N clients to N files. 1

Sequential IOR Performance N clients to 1 file. 1

Random small blocks IOzone Performance N clients to N files. 1

Metadata performance with MDtest using 4 KiB files. 1

Conclusions and Future Work. 1

Introduction

Today’s HPC environments have increased demands for very high-speed storage and with the higher count CPUs, faster networks and bigger memory, storage was becoming the bottleneck in many workloads. Those high demand HPC requirements are typically covered by Parallel File Systems (PFS) that provide concurrent access to a single file or a set of files from multiple nodes, very efficiently and securely distributing data to multiple LUNs across several servers. Those files systems are normally spinning media based to provide the highest capacity at the lowest cost. However, more and more often, the speed and latency of spinning media cannot keep up with the demands of many modern HPC workloads, requiring the use of flash technology in the form of burst buffers, faster tiers, or even very fast scratch, local or distributed. The DellEMC Ready Solution for HPC PixStor Storage uses NVMe nodes as the component to cover such new high bandwidth demands in addition to being flexible, scalable, efficient, and reliable.

Solution Architecture

This blog is part of a series for Parallel File System (PFS) solutions for HPC environments, in particular for the DellEMC Ready Solution for HPC PixStor Storage, where DellEMC PowerEdge R640 servers with NVMe drives are used as a fast flash based tier.
The PixStor PFS solution includes the widespread General Parallel File System also known as Spectrum Scale. ArcaStream also includes many other software components to provide advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities and more.

The NVMe nodes presented in this blog provide a very high-performance flash-based tier for the PixStor solution. Performance and capacity for this NVMe tier can be scaled out by additional NVMe nodes. Increased capacity is provided by selecting the appropriate NVMe devices supported in the PowerEdge R640.

Figure 1 presents the reference architecture depicting a solution with 4 NVMe nodes using the high demand metadata module, which handles all metadata in the configuration tested. The reason is that currently these NVMe nodes were used as data-only Storage targets. However, the NVMe nodes can also be used to store data and metadata, or even as a faster flash alternative to the high demand metadata module, if extreme metadata demands call for it. Those configurations for the NVMe nodes were not tested as part of this work, but will be tested in the future.

SLN321889_en_US__1image001(8)

Figure 1 Reference Architecture

Solution Components

This solution uses the latest Intel Xeon 2^nd generation Scalable Xeon CPUs, a.k.a. Cascade Lake CPUs and the fastest RAM available (2933 MT/s), except for the management nodes to keep them cost effective. In addition, the solution was updated to the latest version of PixStor (5.1.3.1) that supports RHEL 7.7 and OFED 5.0 which will be the supported software versions at release time.

Each NVMe node has eight Dell P4610 devices that are configured as eight RAID 10 devices across a pair of servers, using an NVMe over Fabric solution to allow data redundancy not only at the devices level, but at the server level. In addition, when any data goes into or out of one of those RAID10 devices, all 16 drives in both servers are used, increasing the bandwidth of the access to that of all the drives. Therefore, the only restriction for these components is that they must be sold and used in pairs. All the NVMe drives supported by the PowerEdge R640 can be used in this solution, however the P4610 has a sequential bandwidth of 3200 MB/s for both reads and writes, as well as high random IOPS specs, which are nice features when trying to scale estimate the number of pairs needed to meet the requirements of this flash tier.

Each R640 server has two HCAs Mellanox ConnectX-6 Single Port VPI HDR100 that are used as EDR 100 Gb IB connections. However, the NVMe nodes are ready to support HDR100 speeds when used with HDR cables and switches. Testing HDR100 on these nodes is deferred as part of the HDR100 update for the whole PixStor solution. Both CX6 interfaces are used to sync data for the RAID 10 (NVMe over fabric) and as the connectivity for the file system. In addition, they provide hardware redundancy at the adapter, port and cable. For redundancy at the switch level, dual port CX6 VPI adapters are required, but need to be procured as S&P components.
To characterize the performance of NVMe nodes, from the system depicted in figure 1, only the high demand metadata module and the NVMe nodes were used.

Table 1 has the list of main components for the solution. From the list of drives supported in the ME4024, 960Gb SSDs were used for metadata and were the ones used for performance characterization, and faster drives can provide better Random IOPs and may improve create/removal metadata operations. All the NVMe device supported on the PowerEdge R640 will be supported for the NVMe nodes.

Table 1 Components to be used at release time and those used in the test bed

Solution Component		At Release
Internal Connectivity		Dell Networking S3048-ON Gigabit Ethernet
Data Storage Subsystem		1x to 4x Dell EMC PowerVault ME4084 1x to 4x Dell EMC PowerVault ME484 (One per ME4084) 80 – 12TB 3.5" NL SAS3 HDD drives Options 900GB @15K, 1.2TB @10K, 1.8TB @10K, 2.4TB @10K, 4TB NLS, 8TB NLS, 10TB NLS, 12TB NLS. 8 LUNs, linear 8+2 RAID 6, chunk size 512KiB. 4x 1.92TB SAS3 SSDs for Metadata – 2x RAID 1 (or 4 - Global HDD spares, if Optional High Demand Metadata Module is used)
Optional High Demand Metadata Storage Subsystem		1x to 2x Dell EMC PowerVault ME4024 (4x ME4024 if needed, Large config only) 24x 960GB 2.5" SSD SAS3 drives (Options 480GB, 960GB, 1.92TB, 3.84TB) 12 LUNs, linear RAID 1.
RAID Storage Controllers		12 Gbps SAS
Processor	NVMe Nodes	2x Intel Xeon Gold 6230 2.1G, 20C/40T 10.4GT/s, 27.5M Cache, Turbo, HT (125W) DDR4-2933
	High Demand Metadata
	Storage Node
	Management Node	2x Intel Xeon Gold 5220 2.2G, 18C/36T 10.4GT/s, 24.75M Cache, Turbo, HT (125W) DDR4-2666
Memory	NVMe Nodes	12x 16GiB 2933 MT/s RDIMMs (192 GiB)
	High Demand Metadata
	Storage Node
	Management Node	12x 16GB DIMMs, 2666 MT/s (192GiB)
Operating System		CentOS 7.7
Kernel version		3.10.0-1062.12.1.el7.x86_64
PixStor Software		5.1.3.1
File system Software		Spectrum Scale (GPFS) 5.0.4-3 with NVMesh 2.0.1
High Performance Network Connectivity		NVMe nodes: 2x ConnectX-6 InfiniBand using EDR/100 GbE Other servers: Mellanox ConnectX-5 InfiniBand EDR/100 GbE, and 10 GbE
High Performance Switch		2x Mellanox SB7800
OFED Version		Mellanox OFED 5.0-2.1.8.0
Local Disks (OS & Analysis/monitoring)		All servers except those listed NVMe Nodes 3x 480GB SSD SAS3 (RAID1 + HS) for OS 3x 480GB SSD SAS3 (RAID1 + HS) for OS PERC H730P RAID controller PERC H740P RAID controller Management Node 3x 480GB SSD SAS3 (RAID1 + HS) for OS with PERC H740P RAID controller
Systems Management		iDRAC 9 Enterprise + DellEMC OpenManage

Performance Characterization

To characterize this new Ready Solution component, the following benchmarks were used:

·       IOzone N to N sequential
·       IOR N to 1 sequential
·       IOzone random
·       MDtest

For all benchmarks listed above, the test bed had the clients as described in the Table 2 below. Since the number of compute nodes available for testing was only 16, when a higher number of threads was required, those threads were equally distributed on the compute nodes (i.e. 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads =16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes available. Since some benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), while avoiding excessive context switching and other related side effects from affecting performance results.

Table 2 Client test bed

Number of Client nodes	16
Client node	C6320
Processors per client node	2x Intel(R) Xeon(R) Gold E5-2697v4 18 Cores @ 2.30GHz
Memory per client node	8x 16GiB 2400 MT/s RDIMMs (128 GiB)
BIOS	2.8.0
OS Kernel	3.10.0-957.10.1
File system Software	Spectrum Scale (GPFS) 5.0.4-3 with NVMesh 2.0.1

Sequential IOzone Performance N clients to N files

Sequential N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 1024 threads in increments of powers of two.

Caching effects on the servers were minimized by setting the GPFS page pool tunable to 16GiB and using files bigger than two times that size. It is important to notice that for GPFS that tunable sets the maximum amount of memory used for caching data, regardless the amount of RAM installed and free. Also, important to notice is that while in previous DellEMC HPC solutions the block size for large sequential transfers is 1 MiB, GPFS was formatted with 8 MiB blocks and therefore that value is used on the benchmark for optimal performance. That may look too large and apparently waste too much space, but GPFS uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided in 256 subblocks of 32 KiB each.

The following commands were used to execute the benchmark for writes and reads, where $Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and threadlist was the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

To avoid any possible data caching effects from the clients, the total data size of the files was twice the total amount of RAM in the clients used. That is, since each client has 128 GiB of RAM, for threads counts equal or above 16 threads the file size was 4096 GiB divided by the number of threads (the variable $Size below was used to manage that value). For those cases with less than 16 threads (which implies each thread was running on a different client), the file size was fixed at twice the amount of memory per client, or 256 GiB.

iozone -i0 -c -e -w -r 8M -s $ G -t $Threads -+n -+m ./threadlist
iozone -i1 -c -e -w -r 8M -s $ G -t $Threads -+n -+m ./threadlist

SLN321889_en_US__2image002(1)

Figure 2 N to N Sequential Performance

From the results we can observe that write performance rises with the number of threads used and then reaches a plateau at around 64 threads for writes and 128 threads for reads. Then read performance also rises fast with the number of threads, and then stays stable until the maximum number of threads that IOzone allow is reached, and therefore large file sequential performance is stable even for 1024 concurrent clients. Write performance drops about 10% at 1024 threads. However, since the client cluster has less than that number of cores, it is uncertain if the drop in performance is due to swapping and similar overhead not observed in spinning media(since NVMe latency is very low compared to spinning media), or if the RAID 10 data synchronization is becoming a bottleneck. More clients are needed to clarify that point. An anomaly on the reads was observed at 64 threads, where performance did not scale at the rate that was observed for previous data points, and then on the next data point moves to a value very close to the sustained performance. More testing is needed to find the reason for such anomaly but is out of the scope of this blog.

The maximum read performance for reads was below the theoretical performance of the NVMe devices (~102 GB/s), or the performance of EDR links, even assuming that one link was mostly used for NVMe over fabric traffic (4x EDR BW ~96 GB/s).
However, this is not a surprise since the hardware configuration is not balanced with respect to the NVMe devices and IB HCAs under each CPU socket. One CX6 adapter is under CPU1, while the CPU2 has all the NVMe devices and the second CX6 adapters. Any storage traffic using the first HCA must use the UPIs to access the NVMe devices. In addition, any core in CPU1 used must access devices or memory assigned to CPU2, so data locality suffers, and UPI links are used. That can explain the reduction for the maximum performance, compared to the max performance of the NVMe devices or the line speed for CX6 HCAs. The alternative to fix that limitation is having a balanced hardware configuration which implies reducing density to half by using an R740 with four x16 slots and use two x16 PCIe expanders to equally distribute NVMe devices on two CPUs and having one CX6 HCA under each CPU.

Sequential IOR Performance N clients to 1 file

Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from a one thread up to 512 threads since there were not enough cores for 1024 or more threads. This benchmark tests used 8 MiB blocks for optimal performance. The previous performance test section has a more complete explanation for why that matters.

Data caching effects were minimized by setting the GPFS page pool tunable to 16GiB and the total file size was twice the total amount of RAM in the clients used. That is, since each client has 128 GiB of RAM, for threads counts equal or above 16 threads the file size was 4096 GiB, and an equal amount of that total was divided by the number of threads (the variable $Size below was used to manage that value). For those cases with less than 16 threads (which implies each thread was running on a different client), the file size was twice the amount of memory per client used times the number of threads, or in other words, each thread was asked to use 256 GiB.

The following commands were used to execute the benchmark for writes and reads, where $Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b $ G

SLN321889_en_US__3image003(5)

Figure 3 N to 1 Sequential Performance

From the results we can observe read and write performance are high regardless of the implicit need for locking mechanisms since all threads access the same file. Performance rises again very fast with the number of threads used and then reaches a plateau that is relatively stable for reads and writes all the way to the maximum number of threads used on this test. Notice that the maximum read performance was 51.6 GB/s at 512 threads, but the plateau in performance is reach at about 64 threads. Similarly, notice that the maximum write performance of 34.5 GB/s was achieved at 16 threads and reached a plateau that can be observed until the maximum number of threads used.

Random small blocks IOzone Performance N clients to N files

Random N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 1024 threads in increments of powers of two.

Tests executed varied from single thread up to 512 threads since there was not enough client-cores for 1024 threads. Each thread was using a different file and the threads were assigned round robin on the client nodes. This benchmark tests used 4 KiB blocks for emulating small blocks traffic and using a queue depth of 16. Results from the large size solution and the capacity expansion are compared.

Caching effects were again minimized by setting the GPFS page pool tunable to 16GiB and to avoid any possible data caching effects from the clients, the total data size of the files was twice the total amount of RAM in the clients used. That is, since each client has 128 GiB of RAM, for threads counts equal or above 16 threads the file size was 4096 GiB divided by the number of threads (the variable $Size below was used to manage that value). For those cases with less than 16 threads (which implies each thread was running on a different client), the file size was fixed at twice the amount of memory per client, or 256 GiB.

iozone -i0 -I -c -e -w -r 8M -s $ G -t $Threads -+n -+m ./nvme_threadlist <= Create the files sequentially
iozone -i2 -I -c -O -w -r 4k -s $ G -t $Threads -+n -+m ./nvme_threadlist <= Perform the random reads and writes.

SLN321889_en_US__4image004(1)

Figure 4 N to N Random Performance

From the results we can observe that write performance starts at a high value of 6K IOps and rises steadily up to 1024 threads where it seems reach a plateau with over 5M IOPS if more threads could be used. Read performance on the other hand starts at 5K IOPSs and increases performance steadily with the number of threads used (keep in mind that number of threads is doubled for each data point) and reaches the maximum performance of 7.3M IOPS at 1024 threads without signs of reaching a plateau. Using more threads will require more than the 16 compute nodes to avoid resources starvation and excessive swapping that can lower apparent performance, where the NVMe nodes could in fact maintain the performance.

Metadata performance with MDtest using 4 KiB files

Metadata performance was measured with MDtest version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 512 threads. The benchmark was used for files only (no directories metadata), getting the number of creates, stats, reads and removes the solution can handle, and results were contrasted with the Large size solution.

The optional High Demand Metadata Module was used, but with a single ME4024 array, even that the large configuration and tested in this work was designated to have two ME4024s. The reason for using that metadata module is that currently these NVMe nodes are used as Storage targets for data only. However, the nodes could be used to store data and metadata, or even as a flash alternative for the high demand metadata module, if extreme metadata demands call for it. Those configurations were not tested as part of this work.

Since the same High Demand Metadata module has been used for previous benchmarking of the DellEMC Ready Solution for HPC PixStor Storage solution, metadata results will be very similar compared to previous blog results. For that reason, the study with empty files was not done, and instead 4 KiB files were used. Since 4KiB files cannot fit into an inode along with the metadata information, NVMe nodes will be used to store data for each file. Therefore, MDtest can give a rough idea of small files performance for reads and the rest of the metadata operations.

The following command was used to execute the benchmark, where $Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. Similar to the Random IO benchmark, the maximum number of threads was limited to 512, since there are not enough cores for 1024 threads and context switching would affect the results, reporting a number lower than the real performance of the solution.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F -w 4K -e 4K

Since performance results can be affected by the total number of IOPS, the number of files per directory and the number of threads, it was decided to keep fixed the total number of files to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed as shown in Table 3.

Table 3 MDtest distribution of files on directories

Number of Threads	Number of directories per thread	Total number of files
1	2048	2,097,152
2	1024	2,097,152
4	512	2,097,152
8	256	2,097,152
16	128	2,097,152
32	64	2,097,152
64	32	2,097,152
128	16	2,097,152
256	8	2,097,152
512	4	2,097,152

SLN321889_en_US__5image005(5)

Figure 5 Metadata Performance – 4 KiB Files

First, notice that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences with several orders of magnitude; otherwise some of the operations would look like a flat line close to 0 on a linear scale. A log graph with base 2 could be more appropriate, since the number of threads are increased in powers of 2, but the graph would look very similar and people tend to handle and remember better numbers based on powers of 10.

The system gets very good results as previously reported with Stat operations reaching the peak value at 64 threads with almost 6.9M op/s and then is reduced for higher thread counts reaching a plateau. Create operations reach the maximum of 113K op/s at 512 threads, so is expected to continue increasing if more client nodes (and cores) are used. Reads and Removes operations attained their maximum at 128 threads, achieving their peak at almost 705K op/s for Reads and 370K op/s for removes, and then they reach plateaus. Stat operations have more variability, but once they reach their peak value, performance does not drop below 3.2M op/s for Stats. Create and Removal are more stable once their reach a plateau and remain above 265K op/s for Removal and 113K op/s for Create. Finally, reads reach a plateau with performance above 265K op/s.

Conclusions and Future Work

The NVMe nodes are an important addition to the HPC storage solution with to provide a very high-performance tier, with good density, very high random-access performance, and very high sequential performance. Furthermore, the solution scales out in capacity and performance linearly as more NVMe nodes modules are added. The performance from the NVMe nodes can be overviewed on Table 4, is expected to be stable and those values can be used to estimate performance for a different number of NVMe nodes.
However, keep in mind that each pair of NVMe nodes will provide half of any number shown in Table 4.
This solution provides HPC customers with a very reliable parallel file system used by many Top 500 HPC clusters. In addition, it provides exceptional search capabilities, advanced monitoring and management, and adding optional gateways allow file sharing via ubiquitous standard protocols like NFS, SMB and others to as many clients as needed.

Table 4 Peak & Sustained Performance for 2 Pairs of NVMe nodes

	Peak Performance		Sustained Performance
	Write	Read	Write	Read
Large Sequential N clients to N files	40.9 GB/s	84.5 GB/s	40 GB/s	81 GB/s
Large Sequential N clients to single shared file	34.5 GB/s	51.6 GB/s	31.5 GB/s	50 GB/s
Random Small blocks N clients to N files	5.06MIOPS	7.31MIOPS	5 MIOPS	7.3 MIOPS
Metadata Create 4KiB files	113K IOps		113K IOps
Metadata Stat 4KiB files	6.88M IOps		3.2M IOps
Metadata Read 4KiB files	705K IOps		500 K IOps
Metadata Remove 4KiB files	370K IOps		265K IOps

Since NVMe nodes were only used for data, possible future work can include using them for data and metadata and have a self-contained flash-based tier with better metadata performance due to the higher bandwidth and lower latency of NVMe devices compared to SAS3 SSDs behind RAID controllers. Alternatively, if a customer has extremely high metadata demands and require a solution more dense than what the high demand metadata module can provide, some or all of the distributed RAID 10 devices can be used for metadata in the same way that RAID 1 devices on the ME4024s are used now.
Another blog to be release soon will characterize the PixStor Gateway nodes, which allow connecting the PixStor solution to other networks using NFS or SMB protocols and can scale out performance. Also, the solution will be updated to HDR100 very soon, and another blog is expected to talk about that work.

Propiedades del artículo

Producto comprometido

High Performance Computing Solution Resources

Fecha de la última publicación

21 feb 2021

Versión

Tipo de artículo

Solution

Volver al principio

Bienvenido

Bienvenido a Dell