Article written by Garima Kochhar, Deepthi Cherlopalle, Joshua Weage of HPC and AI Innovation Lab in September 2018
Summary
The
HPC and AI Innovation Lab has a new cluster with 32 AMD EPYC based systems interconnected with Mellanox EDR InfiniBand. As always, we are conducting performance evaluations on our latest cluster and wanted to share results. This blog covers memory bandwidth results from STREAM, HPL, InfiniBand micro-benchmark performance for latency and bandwidth, and WRF results from its benchmark datasets.
We are interested in real-world HPC application performance on EPYC. If you have datasets you wish to try on EPYC, do get in touch with your Dell account team for access to the Innovation Lab.
AMD EPYC architecture
AMD EPYC processors support eight memory channels, up to 16 dual in-line memory modules (DIMMs) per socket with two DIMMs per channel, and up to 32 cores per socket. Additionally, a platform with AMD CPUs provides up to 128 PCI-E lanes for peripherals like GPUs and NVMe drives.
The CPUs themselves are multi-chip modules assembled from four dies. Each die includes up to eight Zen cores, two DDR4 memory channels and 32 IO lanes. The Zen cores on a die are arranged in two groups of four with each group of four cores, called a core complex, sharing L3 cache. Within a socket, all four dies are cross connected via a coherent interconnect called Infinity Fabric. This is shown in Figure 1.
Figure 1 – EPYC Socket layout. CCX is a core complex of up to 4 cores that share L3 cache. M* are the memory channels, two channels handled by each die. P* and G* are IO lanes. ∞ is the Infinity Fabric.
On a single-socket system, each die provides up to 32 PCI-E lanes using the P* and the G* IO lanes shown in Figure 1. This gives the socket a total of 128 PCI-E lanes as shown in Figure 2. When a CPU is used in a two-socket (2S) configuration, half the IO lanes of each die are used to connect to one of the dies on the other socket by using the G* IO lanes configured as Infinity Fabric. This leaves the socket with the P* IO lanes for a total of 64 PCI-E lanes and, thus, still 128 PCI-E lanes for the platform. This is shown in Figure 3
Figure 2 - EPYC 1S PCI-E lanes
Figure 3 - EPYC 2S layout
Figure 3 - EPYC 2S layout
STREAM benchmark performance
As a first step to evaluating EPYC, we measured the memory bandwidth capabilities of the platform using the
STREAM benchmark. These tests were conducted on a Dell EMC
PowerEdge R7425 server with dual AMD EPYC 7601 processors (32c, 2.2 GHz), 16*16GB DIMMs at 2400 MT/s, running Red Hat® Enterprise Linux® 7.5.
The non-uniform memory access (NUMA) presentation of EPYC can be controlled by a BIOS option called "Memory Interleaving" and mapped using Linux utilities such as numactl and lstopo.
The default Memory Interleaving option is "Memory Channel Interleaving". In this mode, the two channels of each die are interleaved. This presents four NUMA nodes per socket and eight NUMA nodes to the operating system on a 2S system.
"Memory Die Interleaving" is an option where memory across all four dies on a socket, i.e. eight memory channels, are interleaved. This presents one NUMA node per socket and two NUMA nodes on a 2S system.
"Memory Socket Interleaving" interleaves memory across both sockets giving one NUMA node on a 2S platform. This would be the equivalent of NUMA disabled.
Using the default configuration of "Memory Channel Interleaving", recall that each socket has four dies, each die provides two memory channels, and the BIOS presents eight NUMA nodes on a 2S platform. Sample numactl output in Figure 4 shows these eight NUMA nodes on a 2S platform, one NUMA node per die.
Figure 4 - numactl output on 2S EPYC
Physically, there are four NUMA distances on the platform as highlighted in Figure 4: to the NUMA node itself (distance "10" in red), to the three nodes that share the same die (distance "16" in blue), to the node on the other socket that is direct connected via an Infinity Fabric link (distance "22" in green), to the three other nodes on the remote socket that are accessed via two hops using the Infinity Fabric between the two sockets plus the internal Infinity Fabric link (distance "28" in black).
Some BIOS implementations and versions may simplify this physical layout and present only three NUMA distances to the Operating System. This simplification involves masking the difference in distance between NUMA node 0 (as an example) and NUMA nodes 4,5,6,7 by presenting NUMA nodes 4,5,6,7 as equidistant from NUMA node 0. Such an implementation is shown in Figure 5. The NUMA layout will be a tunable option in the next release of the PowerEdge R7425 BIOS. Simplifying the NUMA node distances does not change the actual physical layout of the cores, it is primarily for the benefit of the OS scheduler. For HPC and MPI jobs that are NUMA aware, these different presentations should be immaterial.
Figure 5 - numactl output on 2S EPYC with simplified NUMA distances
In addition to the 8 NUMA nodes on a two-socket platform, Figure 4 and Figure 5 also show the memory and cores associated with each NUMA node. Each NUMA node has 32GB of memory from two 16GB DIMMs (16 DIMMs in the server, eight per socket, 1 DIMM per channel). Each NUMA node contains the eight cores of the local die. The core enumeration on the Dell EMC platform is round robin, going across all NUMA nodes first and then filling each NUMA node.
Additionally, lstopo output can be used to clearly show which set of four cores make up a core complex. These are the four cores on a die that share L3 cache. For example, Figure 6 shows that NUMA node 0 has eight cores and on this NUMA node cores 0, 16, 32, 48 share L3 cache, and cores 8, 24, 40, 56 share L3 cache.
Figure 6 - lstopo output on 2S EPYC
Figure 7 - AMD EPYC platform memory bandwidth
Keeping this NUMA layout information in mind, the STREAM Triad benchmark results for memory bandwidth are presented in Figure 7 with BIOS set to "Memory Channel Interleaving". Note that the 16GB 2667 MT/s dual ranked memory modules used in this test bed run at 2400 MT/s on EPYC. The first set of bars in Figure 7 shows the memory bandwidth of the 2S platform to be 244 GB/s when all cores are used, and 255.5 GB/s when half of the cores are used. The second data point is the memory bandwidth of a single socket, and that is about half of the full 2S platform, as expected. The third data point measures the memory bandwidth of a NUMA node, an individual die. Recall that each socket has four dies and the bandwidth of a die is about ¼th that of the socket. Within a die, there are two core complexes, and using just the cores on one core complex provides ~30GB/s. When cores are used across both core complexes on a die, the full bandwidth of the die can be achieved, ~ 32GB/s.
The 2S platform memory bandwidth is impressive, 240-260GB/s, and is a result of the eight memory channels per socket on the platform. Even more so, a single core can provide ~24.5 GB/s of memory bandwidth to local memory which is great for the single-threaded portion of applications.
Looking at the impact of the NUMA layout on remote memory access, Figure 8 plots the relative memory bandwidth when cores access memory that is not in the same NUMA domain. Access to memory on the same socket is ~30% slower, access to memory on the other socket is ~65% slower. Using STREAM Triad, there seems to be no measurable impact on memory bandwidth when accessing memory on the remote socket over one hop (node6, 1 Infinity Fabric between sockets) or two hops (node4,5,7 - 1 Infinity fabric hop between sockets + 1 local Infinity Fabric hop). For memory bandwidth sensitive applications, good memory locality will be important to performance even within the same socket.
Figure 8 - Impact of remote memory access
HPL benchmark performance
Next, we measured the computational capability of EPYC as measured by the HPL benchmark. EPYC can support AVX instructions and performance 8 FLOP/cycle. On our platform, we used Open MPI and the
BLIS linear algebra libraries to run HPL.
The theoretical performance of our test system (dual EPYC 7601 processors) is 64 cores * 8 FLOP/cycle * 2.2 GHz clock frequency, which gives 1126 GFLOPS. We measured 1133 GLOPS which is an efficiency of 100.6%.
We also ran HPL on the EPYC 7551 (32c, 2.0 GHz), EPYC 7351 (16c, 2.4 GHz), and EPYC 7351P (1S, 16c, 2.4GHz). In these tests, measured HPL performance was 102-106% of theoretical performance.
The efficiency is over 100% is because EPYC is able to sustain turbo frequencies above base frequency during the duration of the HPL test.
InfiniBand latency and bandwidth
We then verified the InfiniBand latency and bandwidth micro-benchmark results between two servers. The configuration used for these tests is described in Table 1. Latency and bandwidth results are in Figure 9 and Figure 10.
Table 1- InfiniBand test bed
Component |
Version |
Processor |
Dell EMC Power Edge R7425 |
Memory |
Dual AMD EPYC 7601 32-Core Processor @ 2.2GHz |
System Profile |
CPU power management set to maximum, C-states disabled or enabled as noted, Turbo enabled |
OS |
Red Hat Enterprise Linux 7.5 |
Kernel |
3.10.0-862.el7.x86_64 |
OFED |
4.4-1.0.0 |
HCA Card |
Mellanox Connect X-5 |
OSU version |
5.4.2 |
MPI |
hpcx-2.2.0 |
Figure 9 – InfiniBand latency, with switch
Run Command: mpirun -np 2 --allow-run-as-root -host node1,node2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 -report-bindings --bind-to core --map-by dist:span -mca rmaps_dist_device mlx5_0 numactl –cpunodebind=6 osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_latency
Care was taken to pin the MPI process to the NUMA node that is closest to the HCA. This information is available in the lstopo output, and in our case it was NUMA node 6. Latency tests were run with both OpenMPI and HPC-X. With OpenMPI & MXM acceleration we measured latency of 1.17µs, with OpenMPI & UCX we measured latency of 1.10µs. The latency results obtained with HPC-X are presented here.
From Figure 9, latency on EPYC processors with C-states enabled is 1.07µs and the latency for all message sizes is ~ 2 to 9% better with C-states enabled when compared to C-states disabled. C-states enabled allows idle cores to be in deeper c-states which allows higher turbo frequencies on the active cores, this results in reduced latency.
Bandwidth results are presented in Figure 10. We measured 12.4 GB/s uni-directional bandwidth and 24.7 GB/s bi-directional bandwidth. These results are as expected for EDR
Figure 10 - InfiniBand bandwidth
Run Command:
mpirun -np 2 --allow-run-as-root -host node208,node209 -mca pml ucx -x UCX_NET_DEVICES= mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --bind-to core -mca rmaps_dist_device mlx5_0 --report-bindings --display-map numactl --cpunodebind=6 osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_bibw
Table 2 - osu_mbw_mr results – single NUMA Node
Socket |
NUMA Node (NN) |
Test Configuration |
Num of cores in test per server |
Bandwidth (GB/s) |
0 |
0 |
server1 NN0 - server2 NN0 |
8 |
6.9 |
0 |
1 |
server1 NN1 - server2 NN1 |
8 |
6.8 |
0 |
2 |
server1 NN2- server2 NN2 |
8 |
6.8 |
0 |
3 |
server1 NN3 - server2 NN3 |
8 |
6.8 |
1 |
4 |
server1 NN4 - server2 NN4 |
8 |
12.1 |
1 |
5 |
server1 NN5 - server2 NN5 |
8 |
12.2 |
1 |
6 (local to HCA) |
server1 NN6 - server2 NN6 |
8 |
12.3 |
1 |
7 |
server1 NN7 - server2 NN7 |
8 |
12.1 |
Run Command:
mpirun -np 16 --allow-run-as-root –host server1,server2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings --bind-to core -mca rmaps_dist_device mlx5_0 numactl cpunodebind=<numanode number> osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_mbw_mr
The NUMA layout described in Figure 3 and Figure 6 prompted us to check the impact of process locality on the bandwidth. For this test we used the osu_mbw_mr benchmark that measures aggregate uni-directional bandwidth between multiple pairs of processes. The objective of this test is to determine the achieved bandwidth and message rate between individual NUMA nodes using all eight cores on the NUMA node. The results of this test are presented in Table 2. The test configuration used Performance profile (C-states disabled and Turbo enabled).
The results show us that when processes are running on the NUMA node that is connected to the InfiniBand HCA (NUMA node 6), the aggregate bandwidth is 12.3 GB/s. When processes are running on any of the other three other NUMA nodes that are on the same socket as the HCA (socket 1), the aggregate bandwidth is about the same ~12.1 GB/s. When processes are run on the NUMA nodes in the socket that is remote to the HCA, the aggregate bandwidth drops to ~6.8 GB/s.
The next set of results shown in Table 3 demonstrates the uni-directional bandwidth between individual sockets. For this test, all 32 cores available in the socket were used. We measured 5.1 GB/s when running on the socket local to the HCA, and 2.4 GB/s when running on the socket remote to the HCA. Using all 64 cores in our test servers, we measured 3.0 GB/s when running 64 processes per server.
To double check this last result, we ran a test using all 8 NUMA nodes across both sockets with each NUMA node running 2 processes giving a total of 16 processes per server. With this layout we measured 2.9 GB/s as well.
These results indicate that the topology of the system has an effect on communication performance. This is of interest for cases where an all-to-all communication pattern with multiple processes communicating across servers is an important factor. For other applications, it is possible that the reduced bandwidth measured when running processes on multiple NUMA domains may not be a factor that influences application-level performance.
Table 3 - osu_mbw_br results – sockets and system level
Socket |
NUMA node |
Test Configuration |
Num of cores in test per server |
Bandwidth (GB/s) |
0 0 0 0 |
0 1 2 3 |
server1 Socket0 - server2 Socket0 |
32 |
2.4 |
1 1 1 1 |
4 5 6 (local to HCA) 7 |
server1 Socket1 - server2 Socket1 |
32 |
5.1 |
Run Command:
mpirun -np 64 --allow-run-as-root –rf rankfile -mca pml ucx -x UCX_NET_DEVICES= mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_mbw_mr
Socket |
NUMA Node |
Test configuration |
Num of cores in test per server |
Bandwidth (GB/s) |
0 0 0 0 1 1 1 1 |
1 2 3 4 5 6 (local to HCA) 7 8 |
server1 - server2 |
64 |
3.0 |
Run Command:
mpirun -np 128 --allow-run-as-root –rf rankfile -mca pml ucx -x UCX_NET_DEVICES= mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_mbw_mr
Socket |
NUMA node |
Test configuration |
Num of cores in test per server |
Bandwidth (GB/s) |
0 |
1 |
server1 - server2 |
2 |
2.9 |
0 |
2 |
2 |
0 |
3 |
2 |
0 |
4 |
2 |
1 |
5 |
2 |
1 |
6 (local to HCA) |
2 |
1 |
7 |
2 |
1 |
8 |
2 |
Run Command:
mpirun -np 32 --allow-run-as-root –rf rankfile -mca pml ucx -x UCX_NET_DEVICES= mlx5_0:1 -x UCX_TLS=rc_x -mca coll_fca_enable 0 -mca coll_hcoll_enable 0 -mca btl_openib_if_include mlx5_0:1 --report-bindings osu-micro-benchmarks-5.4.3/mpi/pt2pt/osu_mbw_mr
HPL cluster-level performance
With our InfiniBand fabric performance validated, the next test was to quickly run HPL across the cluster. These tests were performed on a 16-node system with dual socket EPYC 7601. Results are in Figure 11 and show the expected HPL scalability across the 16 systems.
Figure 11 - HPL across 16 servers
WRF cluster-level performance
Finally we ran WRF, a weather forecasting application. The test bed was the same as before, 16-node system with dual socket EPYC 7601. Plus, we also did some tests on a smaller 4-node system with dual socket EPYC 7551. All servers had 16GB*16 RDIMMs running at 2400 MT/s and were interconnected with Mellanox EDR InfiniBand.
Figure 12 - WRF conus 12km, single node
We used WRF v3.8.1 and v3.9.1 and tested the conus 12km and conus 2.5km data sets. We compiled WRF and netcdf using Intel compilers and ran with Intel MPI. We tried different process and tiling schemes, using both dmpar as well as the dm+sm configuration option with OpenMP.
We are working with AMD to determine other compiler tuning options for WRF.
There was no measured performance difference between WRF v3.8.1 and v3.9.1. Comparing dmpar and dm+sm, a judicious combination of processes and tiles resulted in about the same performance. This is shown in Figure 12.
Figure 13 - WRF conus 12km, cluster tests
Figure 14 - WRF conus 2.5km, cluster tests
Cluster level tests were conducted with WRV v3.8.1 and the dmpar configuration using all the cores and 8 tiles per run.
Conus 12km is a smaller dataset and performance plateaued after 8 nodes, 512 cores on EPYC. This is shown in Figure 13. The EPYC 7551 and EPYC 7601 are both 32 core processors with the 7601 base clock frequency and all-core turbo frequency 10% and 6% faster than the 7551 respectively. On WRF conus 12km tests, the performance of the EPYC 7601 system was 3% faster than the 7551 on 1, 2, and 4 node tests.
Conus 2.5km is a larger benchmark data set, and relative to 1 EPYC system, performance scales well up to 8 nodes (512 cores) and then starts to decline. With conus 2.5km as well, the EPYC 7601 system performs 2-3% faster than the EPYC 7551 on 1, 2, and 4 node tests as shown in Figure 14.
Conclusion and next steps
EPYC provides good memory bandwidth and core density per socket. From an HPC standpoint, we expect applications that can make use of the available memory bandwidth and CPU cores to be able to take most advantage of the EPYC architecture. EPYC today does not support AVX512, or AVX2 in a single cycle, so codes that are highly vectorized and can use AVX2 or AVX512 efficiently may not be ideal for EPYC.
Use cases that can utilize multiple NVMe drives may also benefit from the direct attached NVMe that is possible due to the number of PCI-E lanes on EPYC.
Our next steps include further performance tests with additional HPC applications.