HPC applications performance on C4140 Configuration M

HPC applications performance on C4140 Configuration M


The article was written by Frank Han, Rengan Xu and Quy Ta of Dell EMC HPC & AI Innovation Lab in January 2019.

Abstract

Recently, Dell EMC PowerEdge C4140 added a new "Configuration M" solution. As this latest option join the C4140 family, this article presents the results of the study evaluating Configuration M performance vs. Configuration K for different HPC applications including HPL, GROMACS and NAMD.

Overview

The PowerEdge C4140 is a 2-socket, 1U rack server. It includes support for the Intel Skylake processors, up to 24 DIMMs slots, and four double width NVIDIA Volta GPU cards. In the C4140 server family, two configurations that support NVLINK are Configuration K and Configuration M. The comparison of both topologies is shown in Figure 1. The two major differences between these two configurations are described below:

  1. Higher PCIe bandwidth: In Configuration K, the CPUs are connected to four GPUs by only one PCIe link. In Configuration M, however, each GPU is connected to CPU directly with a dedicated PCIe link. Therefore, there are four PCIe links in total in between the two CPUs with four GPUs providing higher PCIe bandwidth in Configuration M.
  2. Lower latency. Configuration M has no PCIe switch in between the CPU and GPUs. The direct connections reduce the number of hops for data transmission between CPU and GPU, thus the round-trip latency is lower in Configuration M.


    This blog presents the performance of HPC applications on these two configurations. We benchmarked HPL, GROMACS and NAMD with V100-SXM2 16G GPUs. Table 1 lists the hardware and software details.


p2pBandwidthLatencyTest

Figure 2: Card-to-card latency with P2P disabled n C4140 Configuration K and M

The p2pBandwidthLatencyTest is a micro-benchmark included in the CUDA SDK. It measures the card-to-card latency and bandwidth with and without GPUDirect™ Peer-to-Peer enabled. The focus in this test is the Latency part since this program doesn’t measure bandwidth concurrently. The discussion on available real-world bandwidth for applications is in the HPL session below. The numbers listed in Figure 2 are the average of 100 times of unidirectional card-to-card latency in microseconds. Each time the code sends one byte from one card to another, the P2P disabled number is picked in this chart, and because if P2P enabled, the data is transferred through NVLINK instead. The PCIe latency of Configuration M is 1.368 us less than Configuration K due to the different PCIe topologies.

High Performance Linpack (HPL)

(a) Performance
(b) Average PCIe Bandwidth for each V100 GPU
(c) Power consumption of one HPL run


Figure 3 (a) shows HPL performance on the C4140 platform with 1, 2, 4 and 8 V100-SXM2 GPUs. 1-4 GPUs results are from a single C4140, the 8 GPUs performance result is across two servers. In this test, the HPL version used is provided by NVIDIA, and is compiled with recent released CUDA 10 and OpenMPI. The following aspects can be observed from the HPL results:

1) Single node. With all 4 GPUs in test, Configuration M is ~16% faster than Configuration K. Before the HPL application starts computing, it measures the available device-to-host (D2H) and host-to-device (H2D) PCIe bandwidth for each GPU card, when all cards transfer data concurrently. This information provides useful insights on true PCIe bandwidth for each card when HPL copies the N*N Matrix to all GPU memories at the same time. As shown in Figure 3 (b), both D2H and H2D numbers of Configuration M are much higher and are reaching the theoretically throughput of PCIe x16. This matches up with its hardware topology as each GPU in Configuration M has a dedicated PCIe x16 Links to CPU. In Configuration K, all four V100s have to share a single PCIe x16 link via the PLX PCIe Switch so there’s only 2.5GB/s available to each of them. Because of the bandwidth difference, Configuration M took 1.33 seconds to copy the 4 pieces 16GB N*N Matrix to each GPUs’ global memory, and Configuration K took 5.33 seconds. The entire HPL application runs around 23 to 25 seconds. Since all V100-SXM2 are the same, compute time is the same, so this 4 seconds savings from data copying makes Configuration M 16% faster.

2) Multiple nodes. The results of 2 C4140 nodes with 8 GPUs show 15%+ HPL improvement in two nodes. This means Configuration M has better scalability across nodes than Configuration K for the same reason as the single nodes 4 cards in the case above.

3) Efficiency. Power consumption was measured with iDrac, Figure 3 (c) shows the wattage in time series. Both system reaches around 1850 W at peak, due to higher GFLOPS number, Configuration M provides higher performance per watt number as well as HPL efficiency.

HPL is a system level benchmark and its outcomes are determined by components like CPU, GPU, memory and PCIe bandwidth. Configuration M has a balanced design across the two CPUs; therefore, it outperforms Configuration K in this HPL benchmark.

GROMACS

GROMACS is an open source molecular dynamics application designed to simulate biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions. Version 2018.3 is tested on water 3072 dataset which has 3 million atoms.

Figure 4: GROMACS Performance results with multiple V100 on C4140 Configuration K and M

Figure 4 shows the performance improvement of Configuration M over K. Single card performance is the same across the two configurations since there is no difference on the data path. With 2 and 4 GPUs, Configuration M is ~5% faster than K. When tested across 2 nodes, Configuration M has up to 10% better performance; the main reason being the increased number of PCIe connections which provide more bandwidth and allow more data to quickly feed the GPUs. GROMACS is greatly accelerated with GPUs but this application uses both CPUs and GPUs for calculation in parallel; therefore, if GROMACS is the top application in a cluster, a powerful CPU is recommended. This graph also shows GROMACS performance scaling with more servers and more GPUs. While the application’s performance does increase with more GPUs and more servers, the performance increase with additional GPUs is less than linear.

NAnoscale Molecular Dynamics (NAMD)

NAMD is a molecular dynamics code designed for high-performance simulation of large biomolecular systems. In these tests, the prebuild binary wasn’t used. Instead, NAMD was built with the latest source code (NAMD_Git-2018-10-31_Source) on CUDA 10. Figure 4 plots the performance results using the STMV dataset (1,066,628 atoms, periodic, PME). Tests on smaller datasets like f1atpase (327,506 atoms, periodic, PME) and apoa1 (92,224 atoms, periodic, PME) resulted in similar comparisons between Configuration M and Configuration K but are not presented here for brevity.


Figure 5: NAMD Performance results with multiple V100s on C4140 Configuration K and M

Like GROMACS, 4 times more PCIe bandwidth helps the performance on NAMD. Figure 5 shows that the performance of Configuration M with 2 and 4 cards is 16% and 30% more than Configuration K, respectively, on STMV dataset. Single card performance is expected to be the same since, with only one GPU in test, PCIe bandwidth is identical.

Conclusions and Future Work

In this blog, HPC applications performance with HPL, GROMACS and NAMD was compared across two different NVLINK configurations of the PowerEdge C4140. HPL, GROMACS and NAMD perform ~10% better on Configuration M than on Configuration K. In all of the tests, at a minimum, Configuration M delivers the same performance of Configuration K, since it has all the good features of Configuration K plus more PCIe links and no PCIe switches. In the future, additional tests are planned with more applications like RELION, HOOMD and AMBER, as well as tests using the V100 32G GPU.


Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.

Article ID: SLN315976

Last Date Modified: 01/17/2019 11:04 AM


Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.