Article written by Nirmala Sundararajan of the Dell EMC HPC and AI Innovation Lab in November 2019
Table of Contents
- Solution Reference Architecture
- Hardware and Software configuration
- Solution Configuration Details
- R740xd, 24x NVMe Drives, Details on CPU Mapping
- Performance Characterization
- Conclusion and Future Work
Dell EMC HPC team proudly announces the release of the "Dell EMC Ready Solutions for HPC BeeGFS Storage" which is the latest addition to the HPC storage portfolio. This solution uses R740xd servers, each with 24x Intel P4600 1.6TB NVMe, Mixed Use Express Flash drives and two Mellanox ConnectX-5 InfiniBand EDR adapters. In this 24 NVMe drive configuration, 12x NVMe SSDs connect to a PCIe switch and each switch is connected to one CPU via a x16 PCIe extender card. Moreover, each IB interface is connected to one CPU. Such a balanced configuration with each CPU connected to one InfiniBand adapter and handling 12 NVMe SSDs provides maximum performance by ensuring that the processors are equally occupied in handling I/O requests to and from the NVMe drives.
The focus of the solution is high performance I/O and it has been designed as a high-speed scratch solution. At the core of the solution is the use of high speed NVMe SSDs that offer very high bandwidth and low latency by removing the scheduler and queueing bottlenecks
from the block layer. The BeeGFS file system
also supports high aggregate I/O throughput
Figure 1 shows the reference architecture of the solution. The management server is only connected via Ethernet to the metadata and storage servers. Each metadata and storage server have two InfiniBand links and is connected to the private network via Ethernet. The clients have one InfiniBand link and are connected to the private interface via Ethernet.
Dell EMC Ready Solutions for HPC BeeGFS Storage - Reference Architecture
Table 1 and 2 describe the hardware specifications of management server and metadata/storage server respectively. Table 3 describes the software versions used for the solution.
|Table 1 PowerEdge R640 Configuration (Management Server)
||Dell EMC PowerEdge R640
||2x Intel Xeon Gold 5218 2.3 GHz, 16 cores
||12 x 8GB DDR4 2666MT/s DIMMs - 96GB
||6 x 300GB 15K RPM SAS 2.5in HDDs
||PERC H740P Integrated RAID Controller
|Out of Band Management
||iDRAC9 Enterprise with Lifecycle Controller
||Dual 1100W Power Supply Units
|Table 2 PowerEdge R740xd Configuration (Metadata and Storage Servers)
||Dell EMC PowerEdge R740xd
||2x Intel Xeon Platinum 8268 CPU @ 2.90GHz, 24 cores
||12 x 32GB DDR4 2933MT/s DIMMs - 384GB
||2x 240GB M.2 SATA SSDs in RAID 1 for OS
||24x Dell Express Flash NVMe P4600 1.6TB 2.5" U.2
|Mellanox EDR card
||2x Mellanox ConnectX-5 EDR card (Slots 1 & 8)
|Out of Band Management
||iDRAC9 Enterprise with Lifecycle Controller
||Dual 2000W Power Supply Units
|Table 3 Software Configuration (Metadata and Storage Servers)
|Systems Management Tool
||OpenManage Server Administrator 9.3.0-3407_A00
|*Intel ® Data Center Tool
*For Management and Firmware update of Intel P4600NVMe SSDs
The BeeGFS architecture consists of four main services:
- Management service
- Metadata Service
- Storage Service
- Client Service
Except for the client service which is a kernel module, the management, metadata and storage services are user space processes. Figure 2 illustrates how the reference architecture of the Dell EMC Ready Solutions for HPC BeeGFS Storage maps to the general architecture of the BeeGFS file system.
: BeeGFS File system on PowerEdge R740xd with NVMe SSDs
Each BeeGFS file system or namespace has only one management service. The management service is the first service which needs to be setup because when we configure all other services, they need to register with the management service. A PowerEdge R640 is used as the management server. In addition to hosting the management service (beegfs-mgmtd.service
), it also hosts the monitoring service (beegfs-mon.service
) which collects statistics from the system and provides them to the user, using the time series database InfluxDB
. For visualization of data, beegfs-mon
provides predefined Grafana
panes that can be used out of the box. The management server has 6x 300GB HDDs configured in RAID 10 for the Operating System and InfluxDB.
The metadata service is a scale-out service, which means that there can be many metadata services in a BeeGFS file system. However, each metadata service has exactly one metadata target to store metadata. On the metadata target, BeeGFS creates one metadata file per user created file. BeeGFS metadata is distributed on a per-directory basis. The metadata service provides the data striping information to the clients and is not involved in the data access between file open/close.
A PowerEdge R740xd with 24x Intel P4600 1.6TB NVMe, drives are used for metadata storage. As the storage capacity requirements for BeeGFS metadata are very small, instead of using a dedicated metadata server, only the 12 drives on NUMA zone 0 were utilized to host the M
argets (MDTs), while the remaining 12 drives on NUMA zone host S
Figure 3 shows the metadata server. The 12 drives enclosed in the yellow rectangle are the MDTs in the NUMA zone 0 whereas the 12 drives enclosed in the green rectangle are the STs in the NUMA zone 1. This configuration not only avoids NUMA issues but also provides enough metadata storage to facilitate scaling the capacity and performance as needed.
Figure 3: Metadata Server
Figure 4 shows the raid configuration of the metadata server. It highlights how in the metadata server, the drives in the NUMA zone 0 host the MDTs and those in NUMA zone 1 host the storage data, while the storage servers host the STs in both the NUMA zones.
Configuration of drives in the Metadata Server
The 12 drives used for metadata are configured as 6x RAID 1 disk group of 2 drives, each serving as an MDT. There are 6 metadata services running each of which handles one MDT. The remaining 12 storage drives are configured in 3x RAID 0 disk groups of 4 drives each. There are three storage services running on the NUMA 1 zone, one service for each ST. So, the server which co-hosts the metadata and Storage Targets has 6 MDTs and 3 STs. It also runs 6 metadata services and three storage services. Each MDT is an ext4 file system based on a RAID 1 configuration. The STs are based on XFS file system configured in RAID 0.
Like the metadata service, the storage service is also a scale-out service. There can be many instances of the storage service in a BeeGFS file system. However, unlike the metadata service, there can be several storage targets per storage service. The storage service stores the striped user file contents, also known as data chunk files
Figure 5 shows the 5x PowerEdge R740xd servers used as storage servers.
Dedicated Storage Servers
Each storage server is configured with 6x RAID 0 groups, each of 4 drives, thus hosting 6 STs per server (3 per NUMA zone), as shown in Figure 6 given below:
Configuration of drives in the Storage Servers
In total the base reference architecture configuration hosts 6 MDTs and 33 STs. Having five dedicated storage servers provides a raw capacity of 211 TB and a usable capacity of 190TiB. The estimated usable capacity in TiB = Number of drives x capacity per drive in TB x 0.99 (file system overhead) x (10^12/2^40). This would be ideal as a mid-range scratch solution with enough metadata storage to facilitate adding more storage servers as capacity requirements increase.
In view of the following factors, a RAID 0 configuration was chosen for storage targets over RAID 10 configuration.
- Write performance was measured using dd command by creating a 10GiB file of 1MiB block size and direct I/O for data, for RAID 0 devices the average was about 5.1 GB/s to each device whereas for RAID 10 devices the average was 3.4GB/s to each device.
- StorageBench benchmark tests showed maximum throughput was 5.5 GB/s for RAID 0 configuration whereas it is 3.4 GB/s for a RAID 10 configuration. These results are like what was obtained using dd commands.
- RAID 10 provides 50% utilization of the disk capacity and a similar 50% reduction in write performance. Using RAID 10 is an expensive way to obtain storage redundancy.
- NVMe drives are expensive and offer speedups which are best utilized in a RAID 0 configuration
- The Intel P4600 SSDs used in the configuration are very reliable with a Daily Writes Per Day (DWPD) of 3 which means that one could write 4.8 TB of data into it every single day for the next five years.
The BeeGFS client module needs to be loaded on to all the hosts that need to access the BeeGFSs file system. When the beegfs-client
is loaded, it will mount the file systems defined in the /etc/beegfs/beegfs-mounts.conf
file instead of the usual approach based on /etc/fstab
. Adopting this approach starts the beegfs-client like any other Linux service through the service startup script. It also enables the automatic recompilation of the BeeGFS client module after system updates.
When the client module is loaded, it will mount the file systems defined in the beegfs-mounts.conf
. It is possible to mount multiple beegfs instances on the same client as shown below:
$ cat /etc/beegfs/beegfs-mounts.conf
The above example shows two different file systems mounted on the same client. For the purpose of this testing, 32x C6420 nodes were used as clients.
In the 24xNVMe configuration of the PowerEdge R740xd server, there are two x16 NVMe bridge cards feeding PCIe switch on the backplane that fans out and feeds the drives (drives are x4) in the front as shown in Figure 7 below:
R740xd, 24x NVMe Details on CPU Mapping
In Non-Uniform Memory Access (NUMA), system memory is divided into zones called nodes, which are allocated to CPUs or sockets. Access to memory that is local to a CPU is faster than memory connected to remote CPUs on the system. A threaded application typically performs best when the threads are accessing memory on the same NUMA node. The performance impact of NUMA misses is significant, generally starting at a 10% performance hit or higher. To improve performance, the services are configured to use specific NUMA zones to avoid unnecessary use of UPI cross-socket links thereby reducing latency. Each NUMA zone handles 12 drives and uses one of the two InfiniBand EDR interfaces on the servers. This NUMA separation is achieved by manually configuring NUMA balancing by creating custom systemd unit files and by configuring multihoming
. Hence the automatic NUMA balancing is disabled, as shown below:
# cat /proc/sys/kernel/numa_balancing
Figure 8 shows the testbed where the InfiniBand connections to the NUMA zone is highlighted. Each server has two IP links and the traffic through NUMA 0 zone is handed by interface IB0 while the traffic through NUMA 1 zone is handled by interface IB1.
This section presents the performance evaluation that helps characterize Dell EMC Ready Solution for HPC BeeGFS High performance Storage Solution. For further details and updates, please look for a white paper that will be published later. The system performance was evaluated using the IOzone benchmark
. The solution is tested for sequential read and write throughput, and random read and write IOPS. Table 4 describes the configuration of the C6420 servers that were used as BeeGFS clients for the performance studies presented in this blog.
|Table 4 Client Configuration
||32x Dell EMC PowerEdge C6420 Compute Nodes
||2x Intel Xeon Gold 6148 CPU @ 2.40GHz with 20 cores per processor
|| 12 x 16GB DDR4 2666 MT/s DIMMs - 192GB
||2x 120GB M.2 boot drives in RAID 1 for OS
||Red Hat Enterprise Linux Server release 7.6
||1x Mellanox ConnectX-4 EDR card
To evaluate sequential reads and writes, the IOzone benchmark was used in the sequential read and write mode. These tests were conducted on multiple thread counts starting at 1 thread and increasing in powers of 2, up to 1024 threads. At each thread count, an equal number of files were generated since this test works on one file per thread or the N clients to N file (N-N) case. The processes were distributed across 32 physical client nodes in a round robin or cyclical fashion so that the requests are equally distributed and there is load balancing. An aggregate file size of 8TB was selected which was equally divided among the number of threads within any given test. The aggregate file size was chosen large enough to minimize the effects of caching from the servers as well as from BeeGFS clients. IOzone was run in a combined mode of write then read (-i 0, -i 1) to allow it to coordinate the boundaries between the operations. For this testing and results, we used a 1MiB record size for every run. The commands used for Sequential N-N tests are given below:
Sequential Writes and Reads: iozone -i 0 -i 1 -c -e -w -r 1m -I -s $Size -t $Thread -+n -+m /path/to/threadlist
OS caches were also dropped or cleaned on the client nodes between iterations as well as between write and read tests by running the command:
# sync && echo 3 > /proc/sys/vm/drop_caches
The default stripe count for Beegfs is 4. However, the chunk size and the number of targets per file can be configured on a per-directory basis. For all these tests, BeeGFS stripe size was chosen to be 2MB and stripe count was chosen to be 3 since we have three targets per NUMA zone as shown below:
$ beegfs-ctl --getentryinfo --mount=/mnt/beegfs /mnt/beegfs/benchmark --verbose
Metadata node: node001-numa0-4 [ID: 4]
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 2M
+ Number of storage targets: desired: 3
+ Storage Pool: 1 (Default)
Inode hash path: 7/5E/0-5D9BA1BC-1
The transparent huge pages were disabled, and the following tuning options are in place on the metadata and storage servers:
- vm.dirty_background_ratio = 5
- vm.dirty_ratio = 20
- vm.min_free_kbytes = 262144
- vm.vfs_cache_pressure = 50
- vm.zone_reclaim_mode = 2
- kernel.numa_balancing = 0
In addition to the above, the following BeeGFS tuning options were used:
- tuneTargetChooser parameter was set to "roundrobin" in the metadata configuration file
- tuneNumWorkers parameter was set to 24 for metadata and 32 for storage
- connMaxInternodeNum parameter was set to 32 for metadata and 12 for storage and 24 for clients
Sequential IOzone 8TB aggregate file size
In Figure 9, we see that peak read performance is 132 GB/s at 1024 threads and peak write is 121 GB/s at 256 threads. As per the technical specifications of the Intel P4600 1.6 TB NVMe SSDs, each drive can provide 3.2 GB/s peak read performance and 1.3 GB/s peak write performance, which allows a theoretical peak of 422 GB/s for reads and 172 GB/s for writes. However, here the network is the limiting factor. We have a total of 11 InfiniBand EDR links for the storage servers in the set up. Each link can provide a theoretical peak performance of 12.4 GB/s which allows a theoretical peak performance of 136.4 GB/s. The achieved peak read and write performance are 97% and 89% respectively of the theoretical peak performance.
The single thread write performance is observed to be ~3 GB/s and read at ~3 GB/s. We observe that the write performance scales linearly, peaks at 256 threads and then starts decreasing. At lower thread counts read and write performance are the same. Because until 8 threads, we have 8 clients writing 8 files across 24 targets which means, not all storage targets are being fully utilized. We have 33 storage targets in the system and hence at least 11 threads are needed to fully utilize all the servers. The read performance registers a steady linear increase with the increase in the number of concurrent threads and we observe almost similar performance at 512 and 1024 threads.
We also observe that the read performance is lower than writes for thread counts from 16 to 128 and then the read performance starts scaling. This is because while a PCIe read operation is a Non-Posted Operation, requiring both a request and a completion, a PCIe write operation is a fire and forget operation. Once the Transaction Layer Packet is handed over to the Data Link Layer, the operation completes. A write operation is a "Posted" operation that consists of a request only.
Read throughput is typically lower than the write throughput because reads require two transactions instead of a single write for the same amount of data. The PCI Express uses a split transaction model for reads. The read transaction includes the following steps:
- The requester sends a Memory Read Request (MRR).
- The completer sends out the acknowledgement to MRR.
- The completer returns a Completion with Data.
The read throughput depends on the delay between the time the read request is issued and the time the completer takes to return the data. However, when the application issues enough number of read requests to cover this delay, then throughput is maximized. That is the reason why while the read performance is less than that of the writes from 16 threads to 128 threads, we measure an increased throughput when the number of requests increases. A lower throughput is measured when the requester waits for completion before issuing subsequent requests. A higher throughput is registered when multiple requests are issued to amortize the delay after the first data returns.
More details regarding the PCI Express Direct Memory Access is available at https://www.intel.com/content/www/us/en/programmable/documentation/nik1412547570040.html#nik1412547565760
To evaluate random IO performance, IOzone was used in the random mode. Tests were conducted on thread counts starting from 4 threads to up to 1024 threads. Direct IO option (-I) was used to run IOzone so that all operations bypass the buffer cache and go directly to the disk. BeeGFS stripe count of 3 and chunk size of 2MB was used. A 4KiB request size is used on IOzone. Performance is measured in I/O operations per second (IOPS). The OS caches were dropped between the runs on the BeeGFS servers as well as BeeGFS clients. The command used for executing the random writes and reads is given below:
Random reads and writes: iozone -i 2 -w -c -O -I -r 4K -s $Size -t $Thread -+n -+m /path/to/threadlist
Figure 10: Random Read and Write Performance using IOzone wth 8TB aggregate file size
The random writes peak at ~3.6 Million IOPS at 512 threads and the random reads peak at ~3.5 Million IOPS at 1024 threads as shown in Figure 10. Both the write and read performance show a higher performance when there are a higher number of IO requests. This is because NVMe standard supports up to 64K I/O queue and up to 64K commands per queue. This large pool of NVMe queues provide higher levels of I/O parallelism and hence we observe IOPS exceeding 3 Million.
This blog announces the release of the Dell EMC High Performance BeeGFS Storage Solution and highlights its performance characteristics. The solution has a peak sequential read and write performance of ~132 GB/s and ~121 GB/s respectively and the random writes peak at ~3.6 Million IOPS and random reads at ~3.5 Million IOPS.
This blog is part one of "BeeGFS Storage Solution" which has been designed with a focus on scratch space with high performance. Stay tuned for Part 2 of the blog series that will describe how the solution can be scaled by incrementing the number of servers to increase performance and capacity. Part 3 of the blog series will discuss additional features of BeeGFS and will highlight the use of "StorageBench", the built-in storage targets benchmark of BeeGFS.
As a part of the next steps, we will be publishing a white paper later with the metadata performance and the N threads to 1 file IOR performance and with additional details about design considerations, tuning and configuration.
 BeeGFS Documentation: https://www.beegfs.io/wiki/
 How to connect two interfaces on the same subnet: https://access.redhat.com/solutions/30564
 PCI Express Direct Memory Access Reference Design using External Memory: https://www.intel.com/content/www/us/en/programmable/documentation/nik1412547570040.html#nik1412547565760
Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.