HPC applications performance with Turing

HPC applications performance with Turing


Article written by Frank Han, Rengan Xu, Deepthi Cherlopalle and Quy Ta of Dell EMC HPC and AI Innovation Lab in March 2019

Table of Contents:

  1. Abstract
  2. Overview
  3. HOOMD-blue
  4. Amber
  5. NAnoscale Molecular Dynamics (NAMD)
  6. High Performance Linpack (HPL)
  7. Conclusions and future work

Abstract

As the successor to the Volta architecture, Turing™ is NVIDIA®’s latest NVIDIA’s family of GPUs. The Turing™ GPU is available with GeForce®, where it is used to render highly realistic games and with Quadro®, accelerating content creation workflows. The NVIDIA® Tesla® series is designed to handle artificial intelligence systems and high performance computing (HPC) workloads in data centers. NVIDIA® Tesla® T4 is the only server-grade GPU with the Turing™ micro-architecture available in the market now, and it is supported by Dell EMC PowerEdge R640, R740, R740xd and R7425 servers. This blog discusses the performance of the new Tesla T4 compared to the latest Volta V100-PCIe on the PowerEdge R740 server for different HPC applications including HOOMD-blue, Amber, NAMD and HPL.


Back to Top


Overview

The PowerEdge R740 server is a 2U Intel® Skylake-based rack-mount server that provides an ideal balance of storage, I/O and accelerator support. It supports up to four* single-slot T4 or three dual-slot width V100-PCIe GPUs in x16 PCIe 3.0 slots. Table 1 notes the differences between a single T4 and V100. The Volta™ V100 is available in 16GB or 32GB memory configurations. Since the T4 is only available in a 16GB version, the V100 card with 16GB memory was used to provide comparative performance results. Table 2 lists the hardware and software details of the test bed.

Table 1: The comparison between T4 and V100

Tesla V100-PCIe

Tesla T4

Architecture

Volta

Turing

CUDA cores

5120

2560

Tensor cores

640

320

Compute capability

7.0

7.5

GPU clock

1245 MHz

585 MHz

Boost clock

1380 MHz

1590 MHz

Memory type

HBM2

GDDR6

Memory bus

4096bit

256bit

Bandwidth

900 GB/s

320GB/s

Slot width

Dual-slot

Single-slot

FP32 single-precision

14 TFLOPS

8.1 TFLOPS

Mixed-precision (FP16/FP32)

112 TFLOPS

65 TFLOPS

FP64 double-precision

7 TFLOPS

254.4 GFLOPS

TDP

250 W

70 W

Table 2: Details of R740 configuration and software version

Processor

2x Intel® Xeon ® Gold 6136 @ 3.0GHz, 12c

Memory

384G(12*32G@2666MHz)

Local disk

480G SSD

Operating system

Red Hat Enterprise Linux Server release 7.5

GPU

3x V100-PCIe 16 GB or 4x T4 16 GB

CUDA driver

410.66

CUDA toolkit

10.0

Processor settings >logical processors

Disabled

System profiles

Performance

HPL

Compiled with CUDA10.0

NAMD

NAMD_Git-2019-02-11

Amber

18.12

HOOMD-blue

v2.5.0

OpenMPI

4.0.0


Back to Top


HOOMD-blue

Figure 1: HOOMD-blue single and double precision performance results with V100s and T4s on the PowerEdge R740 server

HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general-purpose molecular dynamic simulator. By default, HOOMD-blue is compiled in double precision (FP64) and version 2.5 provides a parameter SINGLE_PRECISION=ON to force to compile it in single precision (FP32). Figure 1 shows the microsphere dataset results for single and double precision. The x-axis is the number of GPUs and the performance metric is hours to run 10e6 steps.

  1. One observation is that the FP64 performance of T4 is relatively low. This is due to the hardware limitation. In theory, T4 can deliver 254 GFLOPS (Refer to Table 1) peak performance in double precision whereas V100 is ~27x better. But the performance of applications like HOOMD-blue, which can be compiled and run with single precision, can have a performance advantage with the FP32 compiling option. The HOOMD-blue community has considered our suggestion about supporting mixed-precision on all HOOMD-blue modules. Once the effort is complete, HOOMD-blue can leverage better on mixed-preciscion supported hardware.

  2. Comparing the single precision performance of T4 and V100, we noticed that V100 is 3x better than T4. This performance is expected from T4 due to the number of CUDA cores and power rating on the accelerator.

  3. GPUs in the PowerEdge R740 server are connected through PCIe. For the three V100 GPU data point, the PCIe bus is saturated due to peer-to-peer communications. This impacts the overall performance, resulting in the same performance as one GPU.


Back to Top


Amber


Amber is the collective name for a suite of programs that allows users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Amber version 18.12 with AmberTools 18.13 is tested with the Amber 18 Benchmark Suite, which includes JAC, Cellulose, FactorIX, STMV, TRPCage, myoglobin and nucleosome dataset.

Figure 2: Amber Explicit Solvent results with V100s and T4s on the PowerEdge R740 server

Figure 3: Amber Implicit Solvent results with V100s and T4s on the PowerEdge R740 server

Figure 2 and Figure 3 show the single card and whole system performance numbers on explicit solvent and implicit solvent, respectively. The data point "system" from the above graph represents the full system aggregate throughput of all GPUs. The PowerEdge R740 server supports three V100s or four T4s, so "system" bars in red and blue are the results with three V100s or four T4s.

The reason for preferring aggregate data of multiple GPU cards is that Pascal and later GPUs do not scale beyond a single accelerator for Amber application. Users generally run multiple simulations in parallel on other GPUs. In terms of performance with a large dataset like STMV (1,067,095 atoms), single T4 is 33 percent and the whole system is 44 percent of the V100’s capability. Datasets like TRPCage (304 atoms only) is too small to make effective use of V100s, therefore performance on it is not much faster than T4 as it is for larger PME runs. As per the result on Amber’s official website, almost all GPUs numbers are three to four times faster than CPU-only runs, so having a T4 card in a server dealing with small datasets will be a good option.


Back to Top


NAnoscale Molecular Dynamics (NAMD)

Figure 4: NAMD performance results with V100s and T4s on the PowerEdge R740 server

NAMD is a molecular dynamics code designed for high-performance simulation of large biomolecular systems. In these tests, the prebuilt binary was not used. Instead, NAMD was built with the latest source code (NAMD_Git-2019-02-11) with CUDA 10.0. For best performance, NAMD was compiled with the Intel® compiler and libraries (version 2018u3). Figure 4 plots the performance results using the STMV dataset (1,066,628 atoms, periodic, PME). NAMD doesn’t scale beyond one V100 card, and it scales well with three T4 cards. And single T4 GPU delivers 42 percent of V100’s performance. This is a decent number considering it has only 28 percent of V100’s TDP. T4 could be a choice for datacenters with limited power and cooling capability.


Back to Top


High Performance Linpack (HPL)

Figure 5: HPL results with V100s and T4s on The PowerEdge R740 server

Figure 5 shows HPL performance on the PowerEdge R740 with multiple V100 or T4 GPUs. As expected, HPL numbers scale well with multiple GPUs for V100 and T4. But T4 performance is significantly less than V100 due to its FP64 limitation. Due to the limited double precision capability on T4, the performance comparison with V100 is not ideal and Volta V100 remains the best choice for such double precision applications.


Back to Top


Conclusions and future work

In this blog, HPC application performance with HOOMD-blue, Amber, NAMD and HPL was compared between V100 and T4 on the Dell EMC PowerEdge R740. T4 is not only used for deep learning inference, it is also beneficial for HPC applications with single or mixed precision support. Its low TDP can help speed up traditional data centers where power and cooling capability is limited. T4’s PCIe small form factor makes it a good fit for more general-purpose PowerEdge servers. In the future, additional tests are planned with more applications such as RELION, GROMACS and LAMMPS, as well as tests for applications that can leverage mixed precision.

*Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots.


Back to Top


Need more help?
Find additional PowerEdge and PowerVault articles
Watch Part Replacement Videos for Enterprise products

Visit and ask for support in our Communities

Create an online support Request




Article ID: SLN316570

Last Date Modified: 03/18/2019 08:17 AM


Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.