Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.
Some article numbers may have changed. If this isn't what you're looking for, try searching all articles. Search articles

Dell EMC Ready Solution for HPC PixStor Storage - Capacity Expansion

Summary: Dell EMC Ready Solution for HPC PixStor Storage - Capacity Expansion

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content


Symptoms

Authored by Mario Gallegos of HPC and AI Innovation Lab in April 2020

Cause

None

Resolution

Table of Content

  1. Introduction
    1. Solution Architecture
    2. Solution Components
  2. Performance Characterization
    1. Sequential IOzone Performance N clients to N files
    2. Sequential IOR Performance N clients to 1 file
    3. Random small blocks IOzone Performance N clients to N files
    4. Metadata performance with MDtest using empty files
    5. Metadata performance with MDtest using 4 KiB files
  3. Conclusions and Future Work


 


Introduction

Today’s HPC environments have increased demands for very high-speed storage that also frequently requires high capacity and distributed access via several standard protocols like NFS, SMB and others. Those high demand HPC requirements are typically covered by Parallel File Systems that provide concurrent access to a single file or a set of files from multiple nodes, very efficiently and securely distributing data to multiple LUNs across several servers.

 

Solution Architecture

This blog is a continuation Parallel File System (PFS) solution for HPC environments, the DellEMC Ready Solution for HPC PixStor Storage, where PowerVault ME484 EBOD arrays are used to increase the capacity of the solution. Figure 1 presents the reference architecture depicting the capacity expansion SAS additions to the existing PowerVault ME4084 storage arrays.
The PixStor solution includes the widespread General Parallel File System also known as Spectrum Scale as the PFS component, in addition to many other Arcastream software components like advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities and many others.


SLN321192_en_US__1image001
Figure 1: Reference Architecture.

 

Solution Components

This solution is planned to be released with the latest Intel Xeon 2nd generation Scalable Xeon CPUs, a.k.a. Cascade Lake CPUs and some of the servers will use the fastest RAM available to them (2933 MT/s). However, due to current hardware available to work on the prototype of the solution to characterize performance, servers with Intel Xeon 1st generation Scalable Xeon CPUs a.k.a. Skylake processors and in some cases slower RAM were used to characterize this system. Since the bottleneck of the solution is at the SAS controllers of the DellEMC PowerVault ME40x4 arrays, no significant performance disparity is expected once the Skylake CPUs and RAM are replaced with the envisioned Cascade Lake CPUs and faster RAM. In addition, the solution was updated to the latest version of PixStor (5.1.1.4) that supports RHEL 7.7 and OFED 4.7 for characterizing the system.

Because of the previously described situation, Table 1 has the list of main components for the solution, but when discrepancies were introduced, the first description column has components used at release time and therefore available to customers, and the last column is the components actually used for characterizing the performance of the solution. The drives listed for data (12TB NLS) and metadata (960Gb SSD), are the ones used for performance characterization, and faster drives can provide better Random IOPs and may improve create/removal metadata operations.

Finally, for completeness, the list of possible data HDDs and metadata SSDs was included, which is based on the drives supported as enumerated on the DellEMC PowerVault ME4 support matrix, available online.

Table 1 Components used at release time and those used in the test bed

Solution Component

At Release

Test Bed

Internal Connectivity

Dell Networking S3048-ON Gigabit Ethernet

Data Storage Subsystem

1x to 4x Dell EMC PowerVault ME4084

1x to 4x Dell EMC PowerVault ME484 (One per ME4084)
80 – 12TB 3.5" NL SAS3 HDD drives
Options 900GB @15K, 1.2TB @10K, 1.8TB @10K, 2.4TB @10K,
4TB NLS, 8TB NLS, 10TB NLS, 12TB NLS.
    8 LUNs, linear 8+2 RAID 6, chunk size 512KiB.
4x 1.92TB SAS3 SSDs for Metadata – 2x RAID 1 (or 4 - Global HDD spares, if Optional High Demand Metadata Module is used)

Optional High Demand Metadata Storage Subsystem

1x to 2x Dell EMC PowerVault ME4024 (4x ME4024 if needed, Large config only)
24x 960GB 2.5" SSD SAS3 drives (Options 480GB, 960GB, 1.92TB)
    12 LUNs, linear RAID 1.

RAID Storage Controllers

12 Gbps SAS

Capacity as configured

Raw: 8064 TB (7334 TiB or 7.16 PiB)                 Formatted ~ 6144 GB (5588 TiB or 5.46 PiB)

Processor

Gateway

2x Intel Xeon Gold 6230 2.1G, 20C/40T, 10.4GT/s, 27.5M Cache, Turbo, HT (125W) DDR4-2933

N/A

High Demand Metadata

2x Intel Xeon Gold 6136 @ 3.0 GHz, 12 cores

Storage Node

2x Intel Xeon Gold 6136 @ 3.0 GHz, 12 cores

Management Node

2x Intel Xeon Gold 5220 2.2G, 18C/36T, 10.4GT/s, 24.75M Cache, Turbo, HT (125W) DDR4-2666

2x Intel Xeon Gold 5118 @ 2.30GHz, 12 cores

Memory

Gateway

12x 16GiB 2933 MT/s RDIMMs (192 GiB)

N/A

High Demand Metadata

24x 16GiB 2666 MT/s RDIMMs (384 GiB)

Storage Node

24x 16GiB 2666 MT/s RDIMMs (384 GiB)

Management Node

12x 16GB DIMMs, 2666 MT/s (192GiB)

12x 8GiB 2666 MT/s RDIMMs (96 GiB)

Operating System

Red Hat Enterprise Linux 7.6

Red Hat Enterprise Linux 7.7

Kernel version

3.10.0-957.12.2.el7.x86_64

3.10.0-1062.9.1.el7.x86_64

PixStor Software

5.1.0.0

5.1.1.4

Spectrum Scale (GPFS)

5.0.3

5.0.4-2

High Performance Network Connectivity

Mellanox ConnectX-5 Dual-Port InfiniBand EDR/100 GbE, and 10 GbE

Mellanox ConnectX-5 InfiniBand EDR

High Performance Switch

2x Mellanox SB7800 (HA – Redundant)

1x Mellanox SB7700

OFED Version

Mellanox OFED-4.6-1.0.1.0

Mellanox OFED-4.7-3.2.9

Local Disks (OS & Analysis/monitoring)

All servers except Management node

3x 480GB SSD SAS3 (RAID1 + HS) for OS

PERC H730P RAID controller

Management Node

3x 480GB SSD SAS3 (RAID1 + HS) for OS

PERC H740P RAID controller

All servers except Management node

2x 300GB 15K SAS3 (RAID 1) for OS

PERC H330 RAID controller

Management Node

5x 300GB 15K SAS3 (RAID 5) for OS &
      Analysis/monitoring

PERC H740P RAID controller

Systems Management

iDRAC 9 Enterprise + DellEMC OpenManage

iDRAC 9 Enterprise + DellEMC OpenManage

 

Performance Characterization

To characterize this new Ready Solution, we used the hardware specified in the last column of Table 1, including the optional High Demand Metadata Module. In order to assess the solution performance, the following benchmarks were used:
  • IOzone N to N sequential
  • IOR N to 1 sequential
  • IOzone random
  • MDtest
 For all benchmarks listed above, the test bed had the clients as described in the Table 2 below. Since the number of compute nodes available for testing was only 16, when a higher number of threads was required, those threads were equally distributed on the compute nodes (i.e. 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads =16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes. Since the benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), while avoiding excessive context switching and other related side effects from affecting performance results.

Table 2 Client Test bed

Number of Client nodes

16

Client node

C6320

Processors per client node

2 x Intel(R) Xeon(R) Gold E5-2697v4 18 Cores @ 2.30GHz

Memory per client node

12 x 16GiB 2400 MT/s RDIMMs

BIOS

2.8.0

OS Kernel

3.10.0-957.10.1

GPFS version

5.0.3

 

Sequential IOzone Performance N clients to N files

Sequential N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 1024 threads, and the results of the capacity expanded solution (4x ME4084s + 4x ME484s) are contrasted to the large size solution (4x ME4084s). Caching effects were minimized by setting the GPFS page pool tunable to 16GiB and using files bigger that two times that size. It is important to notice that for GPFS that tunable sets the maximum amount of memory used for caching data, regardless the amount of RAM installed and free. Also, important to notice is that while in previous DellEMC HPC solutions the block size for large sequential transfers is 1 MiB, GPFS was formatted with 8 MiB blocks and therefore that value is used on the benchmark for optimal performance. That may look too large and apparently waste too much space, but GPFS uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided in 256 subblocks of 32 KiB each.

The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and threadlist was the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

./iozone -i0 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist
./iozone -i1 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist

SLN321192_en_US__2image003
Figure 2:  N to N Sequential Performance


From the results we can observe that performance rises very fast with the number of clients used and then reaches a plateau that is stable until the maximum number of threads that IOzone allow is reached, and therefore large file sequential performance is stable even for 1024 concurrent clients. Notice that both read and write performance benefited from the doubling the number of drives. The maximum read performance was limited by the bandwidth of the two IB EDR links used on the storage nodes starting at 8 threads, and ME4 arrays may have some extra performance available. Similarly notice that the maximum write performance increased from a maximum of 16.7 to 20.4 GB/s at 64 and 128 threads and it is closer to the ME4 arrays maximum specs (22 GB/s).

Here it is important to remember that GPFS preferred mode of operation is scattered, and the solution was formatted to use such mode. In this mode, blocks are allocated from the very beginning of operation in a pseudo-random fashion, spreading data across the whole surface of each HDD. While the obvious disadvantage is a smaller initial maximum performance, that performance is maintained fairly constant regardless of how much space is used on the file system. That in contrast to other parallel file systems that initially use the outer tracks that can hold more data (sectors) per disk revolution, and therefore have the highest possible performance the HDDs can provide, but as the system uses more space, inner tracks with less data per revolution are used, with the consequent reduction of performance.

 

Sequential IOR Performance N clients to 1 file

Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from a one thread up to 512 threads (since there were not enough cores for 1024 threads), and results are contrasted to the solution without the capacity expansion.
Caching effects were minimized by setting the GPFS page pool tunable to 16GiB and using files bigger that two times that size. This benchmark tests used 8 MiB blocks for optimal performance. The previous performance test section has a more complete explanation for those matters.

The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b 128G 

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -r -s 1 -t 8m -b 128G

SLN321192_en_US__3image005

Figure 3: N to 1 Sequential Performance

From the results we can observe again that the extra drives benefit read and write performance. Performance rises again very fast with the number of clients used and then reaches a plateau that is fairly stable for reads and writes all the way to the maximum number of threads used on this test. Notice that the maximum read performance was 24.8 GB/s at 16 threads and the bottleneck was the InfiniBand EDR interface, with ME4 arrays still had some extra performance available. From that point, read performance decreased from that value until reaching the plateau at around 23.8 GB/s. Similarly, notice that the maximum write performance of 19.3 was reached at 8 threads and reach a plateau.
 

Random small blocks IOzone Performance N clients to N files

Random N clients to N files performance was measured with FIO version 3.7 instead of the traditional Iozone. The intention, as listed in the previous blog, was of take advantage of a larger Queue Depth to investigate the maximum possible performance that ME4084 arrays can deliver (previous tests for different ME4 solutions showed the ME4084 arrays need more IO pressure that Iozone can deliver to reach their random IO limits).

Tests executed varied from single thread up to 512 threads since there was not enough client-cores for 1024 threads. Each thread was using a different file and the threads were assigned round robin on the client nodes. This benchmark tests used 4 KiB blocks for emulating small blocks traffic and using a queue depth of 16. Results from the large size solution and the capacity expansion are compared.

Caching effects were again minimized by setting the GPFS page pool tunable to 16GiB and using files two times that size. The first performance test section has a more complete explanation about why this is effective on GPFS.

  SLN321192_en_US__4image007
Figure 4:  N to N Random Performance

From the results we can observe that write performance starts at a high value of 29.1K IOps and rises steadily up to 64 threads where it seems to reach a plateau at around 40K IOps. Read performance on the other hand starts at 1.4K IOps and increases performance almost linearly with the number of clients used (keep in mind that number of threads is doubled for each data point) and reaches the maximum performance of 25.6K IOPS at 64 threads where seems to be close to reaching a plateau. Using more threads will require more than the 16 compute nodes to avoid resources starvation and a lower apparent performance, where the arrays could in fact maintain the performance.

 

Metadata performance with MDtest using empty files

Metadata performance was measured with MDtest version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 512 threads. The benchmark was used for files only (no directories metadata), getting the number of creates, stats, reads and removes the solution can handle, and results were contrasted with the Large size solution.

To properly evaluate the solution in comparison to other DellEMC HPC storage solutions and the previous blog results, the optional High Demand Metadata Module was used, but with a single ME4024 array, even that the large configuration and tested in this work was designated to have two ME4024s. This High Demand Metadata Module can support up to four ME4024 arrays, and it is suggested to increase the number of ME4024 arrays to 4, before adding another metadata module. Additional ME4024 arrays are expected to increase the Metadata performance linearly with each additional array, except maybe for Stat operations (and Reads for empty files), since the numbers are very high, at some point the CPUs will become a bottleneck and performance will not continue to increase linearly.

The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. Similar to the Random IO benchmark, the maximum number of threads was limited to 512, since there are not enough cores for 1024 threads and context switching would affect the results, reporting a number lower than the real performance of the solution.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F

Since performance results can be affected by the total number of IOPs, the number of files per directory and the number of threads, it was decided to keep fixed the total number of files to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed as shown in Table 3.

Table 3:  MDtest Distribution of files on directories

Number of Threads

Number of directories per thread

Total number of files

1

2048

2,097,152

2

1024

2,097,152

4

512

2,097,152

8

256

2,097,152

16

128

2,097,152

32

64

2,097,152

64

32

2,097,152

128

16

2,097,152

256

8

2,097,152

512

4

2,097,152

1024

2

2,097,152



SLN321192_en_US__5image009

Figure 5: Metadata Performance - Empty Files

First, notice that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences several orders of magnitude; otherwise some of the operations would look like a flat line close to 0 on a normal graph. A log graph with base 2 could be more appropriate, since the number of threads are increased in powers of 2, but the graph would look very similar and people tend to handle and remember better numbers based on powers of 10.
The system gets very good results with Stat and Read operations reaching their peak value at 64 threads with almost 11M op/s and 4.7M op/s respectively. Removal operations attained the maximum of 170.6K op/s at 16 threads and Create operations achieving their peak at 32 threads with 222.1K op/s. Stat and Read operations have more variability, but once they reach their peak value, performance does not drop below 3M op/s for Stats and 2M op/s for Reads. Create and Removal are more stable once their reach a plateau and remain above 140K op/s for Removal and 120K op/s for Create. Notice the extra drives do not affect much most of the metadata operations on empty files as expected.
 

Metadata performance with MDtest using 4 KiB files

This test is almost identical to the previous one, except that instead of empty files, small files of 4KiB were used. 
The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F -w 4K -e 4K

SLN321192_en_US__6image011
Figure 6:  Metadata Performance - Small files (4K)

The system gets very good results for Stat and Removal operations reaching their peak value at 256 threads with 8.2M op/s and 400K op/s respectively. Read operations attained the maximum of 44.8K op/s and Create operations achieving their peak with 68.1K op/s, both at 512 threads. Stat and Removal operations have more variability, but once they reach their peak value, performance does not drop below 3M op/s for Stats and 280K op/s for Removal. Create and Read have less variability and keep increasing as the number of threads grows. As can be observed, the extra drives of the capacity expansions only provide marginal changes in metadata performance.
Since these numbers are for a metadata module with a single ME4024, performance will increase for each additional ME4024 arrays, however we cannot just assume a linear increase for each operation. Unless the whole file fits inside the inode for such file, data targets on the ME4084s will be used to store the 4K files, limiting the performance to some degree. Since the inode size is 4KiB and it still needs to store metadata, only files around 3 KiB will fit inside and any file bigger than that will use data targets.
 


Conclusions and Future Work

The solution with expanded capacity was able to improve performance, not only for random accesses, but even for sequential performance. That was expected since the scattered mode behaves as randomized accesses and having more disks allows the improvement. That performance, which can be overviewed on Table 4, is expected to be stable from an empty file system until is almost full. Furthermore, the solution scales in capacity and performance linearly as more storage nodes modules are added, and a similar performance increase can be expected from the optional high demand metadata module. This solution provides HPC customers with a very reliable parallel file system used by many Top 500 HPC clusters. In addition, it provides exceptional search capabilities, advanced monitoring and management, and adding optional gateways allow file sharing via ubiquitous standard protocols like NFS, SMB and others to as many clients as needed..

Table 4  Peak and Sustained Performance

 

Peak Performance

Sustained Performance

Write

Read

Write

Read

Large Sequential N clients to N files

20.4 GB/s

24.2 GB/s

20.3 GB/s

24 GB/s

Large Sequential N clients to single shared file

19.3 GB/s

24.8 GB/s

19.3 GB/s

23.8 GB/s

Random Small blocks N clients to N files

40KIOps

25.6KIOps

40.0KIOps

19.3KIOps

Metadata Create empty files

169.4K IOps

123.5K IOps

Metadata Stat empty files

11M IOps

3.2M IOps

Metadata Read empty files

4.7M IOps

2.4M IOps

Metadata Remove empty files

170.6K IOps

156.5K IOps

Metadata Create 4KiB files

68.1K IOps

68.1K IOps

Metadata Stat 4KiB files

8.2M IOps

3M IOps

Metadata Read 4KiB files

44.8K IOps

44.8K IOps

Metadata Remove 4KiB files

400K IOps

280K IOps



Since the solution is intended to be released with Cascade Lake CPUs and faster RAM, once system has the final configuration, some performance spot checks will be done. And test the optional High Demand Metadata Module with at least 2x ME4024s and 4KiB files is needed to better document how metadata performance scales when data targets are involved. In addition, performance for the gateway nodes will be measured and reported along with any relevant results from the spot checks in a new blog or a white paper. Finally, more solution components are planned to be tested and released to provide even more capabilities.

 

Article Properties


Affected Product

Dell EMC Ready Solution Resources

Last Published Date

26 Sept 2023

Version

5

Article Type

Solution