Article Number: 000130819

HPC applications performance with Turing

Summary: Article written by Frank Han, Rengan Xu, Deepthi Cherlopalle and Quy Ta of Dell EMC HPC and AI Innovation Lab in March 2019

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content

Symptoms

Abstract

As the successor to the Volta architecture, Turing™ is NVIDIA®’s latest NVIDIA’s family of GPUs. The Turing™ GPU is available with GeForce®, where it is used to render highly realistic games and with Quadro®, accelerating content creation workflows. The NVIDIA® Tesla® series is designed to handle artificial intelligence systems and high performance computing (HPC) workloads in data centers. NVIDIA® Tesla® T4 is the only server-grade GPU with the Turing™ micro-architecture available in the market now, and it is supported by Dell EMC PowerEdge R640, R740, R740xd and R7425 servers. This blog discusses the performance of the new Tesla T4 compared to the latest Volta V100-PCIe on the PowerEdge R740 server for different HPC applications including HOOMD-blue, Amber, NAMD and HPL.

Overview

The PowerEdge R740 server is a 2U Intel® Skylake-based rack-mount server that provides an ideal balance of storage, I/O and accelerator support. It supports up to four* single-slot T4 or three dual-slot width V100-PCIe GPUs in x16 PCIe 3.0 slots. Table 1 notes the differences between a single T4 and V100. The Volta™ V100 is available in 16GB or 32GB memory configurations. Since the T4 is only available in a 16GB version, the V100 card with 16GB memory was used to provide comparative performance results. Table 2 lists the hardware and software details of the test bed.

Table 1: The comparison between T4 and V100

	Tesla V100-PCIe	Tesla T4
Architecture	Volta	Turing
CUDA cores	5120	2560
Tensor cores	640	320
Compute capability	7.0	7.5
GPU clock	1245 MHz	585 MHz
Boost clock	1380 MHz	1590 MHz
Memory type	HBM2	GDDR6
Memory bus	4096bit	256bit
Bandwidth	900 GB/s	320GB/s
Slot width	Dual-slot	Single-slot
FP32 single-precision	14 TFLOPS	8.1 TFLOPS
Mixed-precision (FP16/FP32)	112 TFLOPS	65 TFLOPS
FP64 double-precision	7 TFLOPS	254.4 GFLOPS
TDP	250 W	70 W

Table 2: Details of R740 configuration and software version

Processor	2x Intel® Xeon ® Gold 6136 @ 3.0GHz, 12c
Memory	384G(12*32G@2666MHz)
Local disk	480G SSD
Operating system	Red Hat Enterprise Linux Server release 7.5
GPU	3x V100-PCIe 16 GB or 4x T4 16 GB
CUDA driver	410.66
CUDA toolkit	10.0
Processor settings >logical processors	Disabled
System profiles	Performance
HPL	Compiled with CUDA10.0
NAMD	NAMD_Git-2019-02-11
Amber	18.12
HOOMD-blue	v2.5.0
OpenMPI	4.0.0

Cause

HOOMD-blue

SLN316570_en_US__1image(9290)

Figure 1: HOOMD-blue single and double precision performance results with V100s and T4s on the PowerEdge R740 server

HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general-purpose molecular dynamic simulator. By default, HOOMD-blue is compiled in double precision (FP64) and version 2.5 provides a parameter SINGLE_PRECISION=ON to force to compile it in single precision (FP32). Figure 1 shows the microsphere dataset results for single and double precision. The x-axis is the number of GPUs and the performance metric is hours to run 10e6 steps.

One observation is that the FP64 performance of T4 is relatively low. This is due to the hardware limitation. In theory, T4 can deliver 254 GFLOPS (Refer to Table 1) peak performance in double precision whereas V100 is ~27x better. But the performance of applications like HOOMD-blue, which can be compiled and run with single precision, can have a performance advantage with the FP32 compiling option. The HOOMD-blue community has considered our suggestion about supporting mixed-precision on all HOOMD-blue modules. Once the effort is complete, HOOMD-blue can leverage better on mixed-preciscion supported hardware.
Comparing the single precision performance of T4 and V100, we noticed that V100 is 3x better than T4. This performance is expected from T4 due to the number of CUDA cores and power rating on the accelerator.
GPUs in the PowerEdge R740 server are connected through PCIe. For the three V100 GPU data point, the PCIe bus is saturated due to peer-to-peer communications. This impacts the overall performance, resulting in the same performance as one GPU.

Amber

Amber is the collective name for a suite of programs that allows users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Amber version 18.12 with AmberTools 18.13 is tested with the Amber 18 Benchmark Suite, which includes JAC, Cellulose, FactorIX, STMV, TRPCage, myoglobin and nucleosome dataset.

SLN316570_en_US__2image(9276)

Figure 2: Amber Explicit Solvent results with V100s and T4s on the PowerEdge R740 server

SLN316570_en_US__3image(9277)

Figure 3: Amber Implicit Solvent results with V100s and T4s on the PowerEdge R740 server

Figure 2 and Figure 3 show the single card and whole system performance numbers on explicit solvent and implicit solvent, respectively. The data point "system" from the above graph represents the full system aggregate throughput of all GPUs. The PowerEdge R740 server supports three V100s or four T4s, so "system" bars in red and blue are the results with three V100s or four T4s.

The reason for preferring aggregate data of multiple GPU cards is that Pascal and later GPUs do not scale beyond a single accelerator for Amber application. Users generally run multiple simulations in parallel on other GPUs. In terms of performance with a large dataset like STMV (1,067,095 atoms), single T4 is 33 percent and the whole system is 44 percent of the V100’s capability. Datasets like TRPCage (304 atoms only) is too small to make effective use of V100s, therefore performance on it is not much faster than T4 as it is for larger PME runs. As per the result on Amber’s official website, almost all GPUs numbers are three to four times faster than CPU-only runs, so having a T4 card in a server dealing with small datasets will be a good option.

Resolution

NAnoscale Molecular Dynamics (NAMD)

SLN316570_en_US__4image(9278)

Figure 4: NAMD performance results with V100s and T4s on the PowerEdge R740 server

NAMD is a molecular dynamics code designed for high-performance simulation of large biomolecular systems. In these tests, the prebuilt binary was not used. Instead, NAMD was built with the latest source code (NAMD_Git-2019-02-11) with CUDA 10.0. For best performance, NAMD was compiled with the Intel® compiler and libraries (version 2018u3). Figure 4 plots the performance results using the STMV dataset (1,066,628 atoms, periodic, PME). NAMD doesn’t scale beyond one V100 card, and it scales well with three T4 cards. And single T4 GPU delivers 42 percent of V100’s performance. This is a decent number considering it has only 28 percent of V100’s TDP. T4 could be a choice for datacenters with limited power and cooling capability.

High Performance Linpack (HPL)

SLN316570_en_US__5image(9283)

Figure 5: HPL results with V100s and T4s on The PowerEdge R740 server

Figure 5 shows HPL performance on the PowerEdge R740 with multiple V100 or T4 GPUs. As expected, HPL numbers scale well with multiple GPUs for V100 and T4. But T4 performance is significantly less than V100 due to its FP64 limitation. Due to the limited double-precision capability on T4, the performance comparison with V100 is not ideal and Volta V100 remains the best choice for such double precision applications.

Conclusions and future work

In this blog, HPC application performance with HOOMD-blue, Amber, NAMD and HPL was compared between V100 and T4 on the Dell EMC PowerEdge R740. T4 is not only used for deep learning inference, it is also beneficial for HPC applications with single or mixed-precision support. Its low TDP can help speed up traditional data centers where power and cooling capability is limited. T4’s PCIe small form factor makes it a good fit for more general-purpose PowerEdge servers. In the future, additional tests are planned with more applications such as RELION, GROMACS and LAMMPS, as well as tests for applications that can leverage mixed precision.

*Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots.

HPC applications performance with Turing

Summary: Article written by Frank Han, Rengan Xu, Deepthi Cherlopalle and Quy Ta of Dell EMC HPC and AI Innovation Lab in March 2019

Article Content

Symptoms

Table of Contents:

Abstract

Overview

Cause

HOOMD-blue

Amber

Resolution

NAnoscale Molecular Dynamics (NAMD)

High Performance Linpack (HPL)

Conclusions and future work

Article Properties

Affected Product

Last Published Date

Version

Article Type

Welcome

Welcome to Dell

HPC applications performance with Turing

Summary: Article written by Frank Han, Rengan Xu, Deepthi Cherlopalle and Quy Ta of Dell EMC HPC and AI Innovation Lab in March 2019

Article Content

Symptoms

Cause

Resolution

Article Properties

Affected Product

Last Published Date

Version

Article Type