Dell EMC Ready Solution for HPC PixStor Storage

Dell EMC Ready Solution for HPC PixStor Storage


Article written by Mario Gallegos of HPC and AI Innovation Lab in October 2019



Table of Content

  1. Introduction
    1. Solution Architecture
    2. Solution Components
  2. Performance Characterization
    1. Sequential IOzone Performance N clients to N files
    2. Sequential IOR Performance N clients to 1 file
    3. Random small blocks IOzone Performance N clients to N files
    4. Metadata performance with MDtest using empty files
    5. Metadata performance with MDtest using 4 KiB files
    6. Metadata Performance using MDtest with 3K files
  3. Advanced Analytics
  4. Conclusions and Future Work



Introduction

Today’s HPC environments have increased demands for very high-speed storage that also frequently requires high capacity and distributed access via several standard protocols like NFS, SMB and others. Those high demand HPC requirements are typically covered by Parallel File Systems that provide concurrent access to a single file or a set of files from multiple nodes, very efficiently and securely distributing data to multiple LUNs across several servers.


Solution Architecture

In this blog we introduce Dell EMC newest addition to the Parallel File System (PFS) solutions for HPC environments, the Dell EMC Ready Solution for HPC PixStor Storage. Figure 1 presents the reference architecture, which leverages Dell EMC PowerEdge R740 servers and the PowerVault ME4084 and ME4024 storage arrays, with the PixStor software from our partner company Arcastream.
PixStor includes the widespread General Parallel File System also known as Spectrum Scale as the PFS component, in addition to Arcastream software components like advanced analytics, simplified administration and monitoring, efficient file search, advanced gateway capabilities and many others.


Figure 1: Reference Architecture.


Solution Components

This solution is planned to be released with the latest Intel Xeon 2nd generation Scalable Xeon CPUs, a.k.a. Cascade Lake CPUs and some of the servers will use the fastest RAM available to them (2933 MT/s). However, due to hardware available to prototype the solution and characterize its performance, servers with Intel Xeon 1st generation Scalable Xeon CPUs a.k.a. Skylake processors and slower RAM were used. Since the bottleneck of the solution is at the SAS controllers of the Dell EMC PowerVault ME40x4 arrays, no significant performance disparity is expected once the Skylake CPUs and RAM are replaced with the envisioned Cascade Lake CPUs and faster RAM. In addition, even that the latest version of PixStor that supported RHEL 7.6 was available at the time of configuring the system, it was decided to continue the QA process and use RHEL 7.5 and the previous minor version of PixStor for characterizing the system. Once the system is updated to Cascade Lake CPUs, the PixStor software will also be updated to the latest version and some performance spot checks will be done to verify performance remained closed to the numbers reported in this document.

Because of the previously described situation, Table 1 has the list of main components for the solution. The middle column has the planned components to be used at release time and therefore available to customers, and the last column is the component list actually used for characterizing the performance of the solution. The drives listed or data (12TB NLS) and metadata (960Gb SSD), are the ones used for performance characterization, and faster drives can provide better Random IOPs and may improve create/removal metadata operations.

Finally, for completeness, the list of possible data HDDs and metadata SSDs was included, which is based on the drives supported as specified on the Dell EMC PowerVault ME4 support matrix, available online.

Table 1 Components to be used at release time and those used in the test bed



Performance Characterization

To characterize this new Ready Solution, we used the hardware specified in the last column of Table 1, including the optional High Demand Metadata Module. In order to assess the solution performance, the following benchmarks were used:

  • IOzone N to N sequential
  • IOR N to 1 sequential
  • IOzone random
  • MDtest

    For all benchmarks listed above, the test bed had the clients as described in the Table 2 below. Since the number of compute nodes available for testing was 16, when a higher number of threads was required, those threads were equally distributed on the compute nodes (i.e. 32 threads = 2 threads per node, 64 threads = 4 threads per node, 128 threads = 8 threads per node, 256 threads =16 threads per node, 512 threads = 32 threads per node, 1024 threads = 64 threads per node). The intention was to simulate a higher number of concurrent clients with the limited number of compute nodes. Since the benchmarks support a high number of threads, a maximum value up to 1024 was used (specified for each test), while avoiding excessive context switching and other related side effects from affecting performance results.

    Table 2 Client test bed

    Number of Client nodes

    16

    Client node

    C6320

    Processors per client node

    2 x Intel(R) Xeon(R) Gold E5-2697v4 18 Cores @ 2.30GHz

    Memory per client node

    12 x 16GiB 2400 MT/s RDIMMs

    BIOS

    2.8.0

    OS Kernel

    3.10.0-957.10.1

    GPFS version

    5.0.3


    Sequential IOzone Performance N clients to N files

    Sequential N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from a single thread up to 1024 threads.
    Caching effects were minimized by setting the GPFS page pool tunable to 16GiB and using files bigger that two times that size. It is important to note that for GPFS that tunable sets the maximum amount of memory used for caching data, regardless the amount of RAM installed and free. Also, important to notice is that while in previous Dell EMC HPC solutions the block size for large sequential transfers is 1 MiB, GPFS was formatted with 8 MiB blocks and therefore that value is used on the benchmark for optimal performance. That may look too large and apparently waste too much space, but GPFS uses subblock allocation to prevent that situation. In the current configuration, each block was subdivided in 256 subblocks of 32 KiB each.
    The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and threadlist was the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

    ./iozone -i0 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist
    ./iozone -i1 -c -e -w -r 8M -s 128G -t $Threads -+n -+m ./threadlist


    Figure 2: N to N Sequential Performance


    From the results we can observe that performance rises very fast with the number of clients used and then reaches a plateau that is stable until the maximum number of threads that IOzone allow is reached, and therefore large file sequential performance is stable even for 1024 concurrent clients. Notice that the maximum read performance was 23 GB/s at 32 threads and very likely the bottleneck was the InfiniBand EDR interface, while ME4 arrays still had some extra performance available. Similarly notice that the maximum write performance of 16.7 was reached a bit early at 16 threads and it is apparently low compared to the ME4 arrays specs.
    Here it is important to remember that GPFS preferred mode of operation is scattered, and the solution was formatted to use it. In this mode, blocks are allocated from the very beginning in a pseudo-random fashion, spreading data across the whole surface of each HDD. While the obvious disadvantage is a smaller initial maximum performance, that performance is maintained fairly constant regardless of how much space is used on the file system. That in contrast to other parallel file systems that initially use the outer tracks that can hold more data (sectors) per disk revolution, and therefore have the highest possible performance the HDDs can provide, but as the system uses more space, inner tracks with less data per revolution are used, with the consequent reduction of performance.


    Sequential IOR Performance N clients to 1 file

    Sequential N clients to a single shared file performance was measured with IOR version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 1024 threads.
    Caching effects were minimized by setting the GPFS page pool tunable to 16GiB and using files bigger that two times that size. This benchmark tests used 8 MiB blocks for optimal performance. The previous performance test section has a more complete explanation for those matters.
    The following commands were used to execute the benchmark for writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

    mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -w -s 1 -t 8m -b 128G

    mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --mca btl_openib_allow_ib 1 --mca pml ^ucx --oversubscribe --prefix /mmfs1/perftest/ompi /mmfs1/perftest/lanl_ior/bin/ior -a POSIX -v -i 1 -d 3 -e -k -o /mmfs1/perftest/tst.file -r -s 1 -t 8m -b 128G



    Figure 3: N to 1 Sequential Performance

    From the results we can observe that performance rises again very fast with the number of clients used and then reaches a plateau that is semi-stable for reads and very stable for writes all the way to the maximum number of threads used on this test. Therefore, large single shared file sequential performance is stable even for 1024 concurrent clients. Notice that the maximum read performance was 23.7 GB/s at 16 threads and very likely the bottleneck was the InfiniBand EDR interface, while ME4 arrays still had some extra performance available. Furthermore, read performance decreased from that value until reaching the plateau at around 20.5 GB/s, with a momentary decrease to 18.5 GB/s at 128 threads. Similarly, notice that the maximum write performance of 16.5 was reached at 16 threads and it is apparently low compared to the ME4 arrays specs.

    Random small blocks IOzone Performance N clients to N files

    Random N clients to N files performance was measured with IOzone version 3.487. Tests executed varied from single thread up to 1024 threads. This benchmark tests used 4 KiB blocks for emulating small blocks traffic.
    Caching effects were minimized by setting the GPFS page pool tunable to 16GiB and using files two times that size. The first performance test section has a more complete explanation about why this is effective on GPFS.
    The following command was used to execute the benchmark in random IO mode for both writes and reads, where Threads was the variable with the number of threads used (1 to 1024 incremented in powers of two), and threadlist was the file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

    ./iozone -i2 -c -O -w -r 4K -s 32G -t $Threads -+n -+m ./threadlist


    Figure 4: N to N Random Performance

    From the results we can observe that write performance starts at a high value of almost 8.2K IOPS and rises steadily up to 128 threads where it reaches a plateau and remains close to the maximum value of 16.2K IOPs. Read performance on the other hand starts very small at over 200 IOPS and increases performance almost linearly with the number of clients used (keep in mind that number of threads is doubled for each data point) and reaches the maximum performance of 20.4K IOPS at 512 threads without signs of reaching the maximum. However, using more threads on the current 16 compute nodes with two CPUs each and where each CPU has 18 cores, have the limitation that there are not enough cores to run the maximum number of IOzone threads (1024) without incurring in context switching (16 x 2 x 18 = 576 cores), which limits performance considerably. A future test with more compute nodes could check what random read performance can be achieved with 1024 threads with IOzone, or IOR could be used to investigate the behavior with more than 1024 threads.


    Metadata performance with MDtest using empty files

    Metadata performance was measured with MDtest version 3.3.0, assisted by OpenMPI v4.0.1 to run the benchmark over the 16 compute nodes. Tests executed varied from single thread up to 512 threads. The benchmark was used for files only (no directories metadata), getting the number of creates, stats, reads and removes the solution can handle.
    To properly evaluate the solution in comparison to other Dell EMC HPC storage solutions, the optional High Demand Metadata Module was used, but with a single ME4024 array, even that the large configuration and tested in this work was designated to have two ME4024s.
    This High Demand Metadata Module can support up to four ME4024 arrays, and it is suggested to increase the number of ME4024 arrays to 4, before adding another metadata module. Additional ME4024 arrays are expected to increase the Metadata performance linearly with each additional array, except maybe for Stat operations (and Reads for empty files), since the numbers are very high, at some point the CPUs will become a bottleneck and performance will not continue to increase linearly.
    The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes. Similar to the Random IO benchmark, the maximum number of threads was limited to 512, since there are not enough cores for 1024 threads and context switching would affect the results, reporting a number lower than the real performance of the solution.

    mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F


    Since performance results can be affected by the total number of IOPs, the number of files per directory and the number of threads, it was decided to keep fixed the total number of files to 2 MiB files (2^21 = 2097152), the number of files per directory fixed at 1024, and the number of directories varied as the number of threads changed as shown in Table 3.

    Table 3: MDtest distribution of files on directories

    Number of Threads

    Number of directories per thread

    Total number of files

    1

    2048

    2,097,152

    2

    1024

    2,097,152

    4

    512

    2,097,152

    8

    256

    2,097,152

    16

    128

    2,097,152

    32

    64

    2,097,152

    64

    32

    2,097,152

    128

    16

    2,097,152

    256

    8

    2,097,152

    512

    4

    2,097,152

    1024

    2

    2,097,152





    Figure 5: Metadata Performance - Empty Files

    First, notice that the scale chosen was logarithmic with base 10, to allow comparing operations that have differences several orders of magnitude; otherwise some of the operations would look like a flat line close to 0 on a normal graph. A log graph with base 2 could be more appropriate, since the number of threads are increased in powers of 2, but the graph would look pretty similar, and people tend to handle and remember better numbers based on powers of 10.


    The system gets very good results with Stat and Read operations reaching their peak value at 64 threads with 11.2M op/s and 4.8M op/s respectively. Removal operations attained the maximum of 169.4K op/s at 16 threads and Create operations achieving their peak at 512 threads with 194.2K op/s. Stat and Read operations have more variability, but once they reach their peak value, performance does not drop below 3M op/s for Stats and 2M op/s for Reads. Create and Removal are more stable once they reach a plateau and remain above 140K op/s for Removal and 120K op/s for Create.

    Metadata performance with MDtest using 4 KiB files

    This test is almost identical to the previous one, except that instead of empty files, small files of 4KiB were used.
    The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

    mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F -w 4K -e 4K


    Figure 6: Metadata Performance - Small files (4K)

    The system gets very good results for Stat and Removal operations reaching their peak value at 128 threads with 7.7M op/s and 1M op/s respectively. Removal operations attained the maximum of 37.3K op/s and Create operations achieving their peak with 55.5K op/s, both at 512 threads. Stat and Removal operations have more variability, but once they reach their peak value, performance does not drop below 4M op/s for Stats and 200K op/s for Removal. Create and Read have less variability and keep increasing as the number of threads grows.
    Since these numbers are for a metadata module with a single ME4024, performance will increase for each additional ME4024 arrays, however we cannot just assume a linear increase for each operation. Unless the whole file fits inside the inode for such file, data targets on the ME4084s will be used to store the 4K files, limiting the performance to some degree. Since the inode size is 4KiB and it still needs to store metadata, only files around 3 KiB will fit inside and any file bigger than that will use data targets.


    Metadata Performance using MDtest with 3K files

    This test is almost exactly identical to the previous ones, except that small files of 3KiB were used. The main difference is that these files fit completely inside the inode. Therefore, the storage nodes and their ME4084s are not used, improving the overall speed by using only SSD media for storage and less network accesses.
    The following command was used to execute the benchmark, where Threads was the variable with the number of threads used (1 to 512 incremented in powers of two), and my_hosts.$Threads is the corresponding file that allocated each thread on a different node, using round robin to spread them homogeneously across the 16 compute nodes.

    mpirun --allow-run-as-root -np $Threads --hostfile my_hosts.$Threads --prefix /mmfs1/perftest/ompi --mca btl_openib_allow_ib 1 /mmfs1/perftest/lanl_ior/bin/mdtest -v -d /mmfs1/perftest/ -i 1 -b $Directories -z 1 -L -I 1024 -y -u -t -F -w 3K -e 3K


    Figure 7: Metadata Performance - Small files (3K)

    The system gets very good results for Stat and Read operations reaching their peak value at 256 threads with 8.29M op/s and 5.06M op/s respectively. Removal operations attained the maximum of 609K op/s at 128 threads and Create operations achieving their peak with 78K op/s at 512 threads. Stat and Read operations have more variability than Create and Removal. Removal has a small drop in performance for the two higher threads points suggesting the sustained performance after 128 threads will be slightly over 400K op/s. Creates kept increasing up to 512 threads, but looks like is reaching a plateau so the maximum performance may still be under 100K op/s.
    Since small files like these are stored completely on the SSD based metadata module, applications requiring superior small files performance, can use one or more optional high demand metadata modules to increase small files performance. However, files that fit in the inode are tiny by current standards. Also, since the metadata targets use RAID1s with SSDs relatively small (max size is 19.2TB), capacity will be limited when compared to the storage nodes. Therefore, care must be exercised to avoid filling up Metadata targets, which can cause unnecessary failures and other problems.


    Advanced Analytics

    Among PixStor capabilities, monitoring the file system via advanced analytics can be essential to greatly simplify administration, helping to proactively or reactively find problems or potential issues. Next, we will briefly review some of these capabilities.
    Figure 8 shows useful information based on the file system capacity. On the left side, the file system total space used, and the top ten users based on file system capacity used. On the right side, a historical view with capacity used across many years, then the top ten file types used and top ten filesets, both based on capacity used, in a format similar to pareto charts (without the lines for cumulative totals). With this information, it can be easy to find users getting more than their fair share of the file system, trends of capacity usage to assist decisions on future growth for capacity, what files are using most of the space or what projects are taking most of the capacity.

    Figure 7:  PixStor Analytics - Capacity view
    Figure 8: PixStor Analytics - Capacity view

    Figure 9 provides a file count view with two very useful ways to find problems. The first half of the screen has the top ten users in a pie chart and top ten file types and top ten filesets (think projects) in a format similar to pareto charts (without the lines for cumulative totals), all based on number of files. This information can be used to answer some important questions. For example, what users that are monopolizing the file system by creating too many files, which type of file is creating a metadata nightmare or what projects are using most of the resources.
    The bottom half has a histogram with number of files (frequency) for file sizes using 5 categories for different file sizes. This can be used to get an idea of the file sizes used across the file system, which coordinated with file types can be used to decide if compression will be beneficial.

    Figure 8  PixStor Analytics - File count view
    Figure 9: PixStor Analytics - File count view




    Conclusions and Future Work

    The current solution was able to deliver fairly good performance, which is expected to be stable regardless of the used space (since the system was formatted in scattered mode), as can be seen in Table 4. Furthermore, the solution scales in capacity and performance linearly as more storage nodes modules are added, and a similar performance increase can be expected from the optional high demand metadata module. This solution provides HPC customers with a very reliable parallel file system used by many Top 500 HPC clusters. In addition, it provides exceptional search capabilities, advanced monitoring and management, and adding optional gateways allow file sharing via ubiquitous standard protocols like NFS, SMB and others to as many clients as needed.

    Table 4 Peak & Sustained Performance

    Peak Performance

    Sustained Performance

    Write

    Read

    Write

    Read

    Large Sequential N clients to N files

    16.7 GB/s

    23 GB/s

    16.5 GB/s

    20.5 GB/s

    Large Sequential N clients to single shared file

    16.5 GB/s

    23.8 GB/s

    16.2 GB/s

    20.5 GB/s

    Random Small blocks N clients to N files

    15.8KIOps

    20.4KIOps

    15.7KIOps

    20.4KIOps

    Metadata Create empty files

    169.4K IOps

    127.2K IOps

    Metadata Stat empty files

    11.2M IOps

    3.3M IOps

    Metadata Read empty files

    4.8M IOps

    2.4M IOps

    Metadata Remove empty files

    194.2K IOps

    144.8K IOps

    Metadata Create 4KiB files

    55.4K IOps

    55.4K IOps

    Metadata Stat 4KiB files

    6.4M IOps

    4M IOps

    Metadata Read 4KiB files

    37.3K IOps

    37.3K IOps

    Metadata Remove 4KiB files

    1M IOps

    219.5K IOps



    Since the solution is intended to be released with Cascade Lake CPUs and faster RAM, once system has the final configuration, some performance spot checks will be done. And test the optional High Demand Metadata Module with at least 2x ME4024s and 4KiB files is needed to better document how metadata performance scales when data targets are involved. In addition, performance for the gateway nodes will be measured and reported along with any relevant results from the spot checks in a new blog or a white paper. Finally, more solution components are planned to be tested and released to provide even more capabilities.


    Read more HPC blogs:



  • Article ID: SLN318841

    Last Date Modified: 11/18/2019 01:09 PM


    Rate this article

    Accurate
    Useful
    Easy to understand
    Was this article helpful?
    Yes No
    Send us feedback
    Comments cannot contain these special characters: <>()\
    Sorry, our feedback system is currently down. Please try again later.

    Thank you for your feedback.