HPC applications performance with Turing

HPC applications performance with Turing

Article written by Frank Han, Rengan Xu, Deepthi Cherlopalle and Quy Ta of Dell EMC HPC and AI Innovation Lab in March 2019

Table of Contents:

  1. Abstract
  2. Overview
  3. HOOMD-blue
  4. Amber
  5. NAnoscale Molecular Dynamics (NAMD)
  6. High Performance Linpack (HPL)
  7. Conclusions and future work


As the successor to the Volta architecture, Turing™ is NVIDIA®’s latest NVIDIA’s family of GPUs. The Turing™ GPU is available with GeForce®, where it is used to render highly realistic games and with Quadro®, accelerating content creation workflows. The NVIDIA® Tesla® series is designed to handle artificial intelligence systems and high performance computing (HPC) workloads in data centers. NVIDIA® Tesla® T4 is the only server-grade GPU with the Turing™ micro-architecture available in the market now, and it is supported by Dell EMC PowerEdge R640, R740, R740xd and R7425 servers. This blog discusses the performance of the new Tesla T4 compared to the latest Volta V100-PCIe on the PowerEdge R740 server for different HPC applications including HOOMD-blue, Amber, NAMD and HPL.

Back to Top


The PowerEdge R740 server is a 2U Intel® Skylake-based rack-mount server that provides an ideal balance of storage, I/O and accelerator support. It supports up to four* single-slot T4 or three dual-slot width V100-PCIe GPUs in x16 PCIe 3.0 slots. Table 1 notes the differences between a single T4 and V100. The Volta™ V100 is available in 16GB or 32GB memory configurations. Since the T4 is only available in a 16GB version, the V100 card with 16GB memory was used to provide comparative performance results. Table 2 lists the hardware and software details of the test bed.

Table 1: The comparison between T4 and V100

Tesla V100-PCIe

Tesla T4




CUDA cores



Tensor cores



Compute capability



GPU clock

1245 MHz

585 MHz

Boost clock

1380 MHz

1590 MHz

Memory type



Memory bus




900 GB/s


Slot width



FP32 single-precision



Mixed-precision (FP16/FP32)



FP64 double-precision


254.4 GFLOPS


250 W

70 W

Table 2: Details of R740 configuration and software version


2x Intel® Xeon ® Gold 6136 @ 3.0GHz, 12c



Local disk

480G SSD

Operating system

Red Hat Enterprise Linux Server release 7.5


3x V100-PCIe 16 GB or 4x T4 16 GB

CUDA driver


CUDA toolkit


Processor settings >logical processors


System profiles



Compiled with CUDA10.0









Back to Top


Figure 1: HOOMD-blue single and double precision performance results with V100s and T4s on the PowerEdge R740 server

HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general-purpose molecular dynamic simulator. By default, HOOMD-blue is compiled in double precision (FP64) and version 2.5 provides a parameter SINGLE_PRECISION=ON to force to compile it in single precision (FP32). Figure 1 shows the microsphere dataset results for single and double precision. The x-axis is the number of GPUs and the performance metric is hours to run 10e6 steps.

  1. One observation is that the FP64 performance of T4 is relatively low. This is due to the hardware limitation. In theory, T4 can deliver 254 GFLOPS (Refer to Table 1) peak performance in double precision whereas V100 is ~27x better. But the performance of applications like HOOMD-blue, which can be compiled and run with single precision, can have a performance advantage with the FP32 compiling option. The HOOMD-blue community has considered our suggestion about supporting mixed-precision on all HOOMD-blue modules. Once the effort is complete, HOOMD-blue can leverage better on mixed-preciscion supported hardware.

  2. Comparing the single precision performance of T4 and V100, we noticed that V100 is 3x better than T4. This performance is expected from T4 due to the number of CUDA cores and power rating on the accelerator.

  3. GPUs in the PowerEdge R740 server are connected through PCIe. For the three V100 GPU data point, the PCIe bus is saturated due to peer-to-peer communications. This impacts the overall performance, resulting in the same performance as one GPU.

Back to Top


Amber is the collective name for a suite of programs that allows users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Amber version 18.12 with AmberTools 18.13 is tested with the Amber 18 Benchmark Suite, which includes JAC, Cellulose, FactorIX, STMV, TRPCage, myoglobin and nucleosome dataset.

Figure 2: Amber Explicit Solvent results with V100s and T4s on the PowerEdge R740 server

Figure 3: Amber Implicit Solvent results with V100s and T4s on the PowerEdge R740 server

Figure 2 and Figure 3 show the single card and whole system performance numbers on explicit solvent and implicit solvent, respectively. The data point "system" from the above graph represents the full system aggregate throughput of all GPUs. The PowerEdge R740 server supports three V100s or four T4s, so "system" bars in red and blue are the results with three V100s or four T4s.

The reason for preferring aggregate data of multiple GPU cards is that Pascal and later GPUs do not scale beyond a single accelerator for Amber application. Users generally run multiple simulations in parallel on other GPUs. In terms of performance with a large dataset like STMV (1,067,095 atoms), single T4 is 33 percent and the whole system is 44 percent of the V100’s capability. Datasets like TRPCage (304 atoms only) is too small to make effective use of V100s, therefore performance on it is not much faster than T4 as it is for larger PME runs. As per the result on Amber’s official website, almost all GPUs numbers are three to four times faster than CPU-only runs, so having a T4 card in a server dealing with small datasets will be a good option.

Back to Top

NAnoscale Molecular Dynamics (NAMD)

Figure 4: NAMD performance results with V100s and T4s on the PowerEdge R740 server

NAMD is a molecular dynamics code designed for high-performance simulation of large biomolecular systems. In these tests, the prebuilt binary was not used. Instead, NAMD was built with the latest source code (NAMD_Git-2019-02-11) with CUDA 10.0. For best performance, NAMD was compiled with the Intel® compiler and libraries (version 2018u3). Figure 4 plots the performance results using the STMV dataset (1,066,628 atoms, periodic, PME). NAMD doesn’t scale beyond one V100 card, and it scales well with three T4 cards. And single T4 GPU delivers 42 percent of V100’s performance. This is a decent number considering it has only 28 percent of V100’s TDP. T4 could be a choice for datacenters with limited power and cooling capability.

Back to Top

High Performance Linpack (HPL)

Figure 5: HPL results with V100s and T4s on The PowerEdge R740 server

Figure 5 shows HPL performance on the PowerEdge R740 with multiple V100 or T4 GPUs. As expected, HPL numbers scale well with multiple GPUs for V100 and T4. But T4 performance is significantly less than V100 due to its FP64 limitation. Due to the limited double precision capability on T4, the performance comparison with V100 is not ideal and Volta V100 remains the best choice for such double precision applications.

Back to Top

Conclusions and future work

In this blog, HPC application performance with HOOMD-blue, Amber, NAMD and HPL was compared between V100 and T4 on the Dell EMC PowerEdge R740. T4 is not only used for deep learning inference, it is also beneficial for HPC applications with single or mixed precision support. Its low TDP can help speed up traditional data centers where power and cooling capability is limited. T4’s PCIe small form factor makes it a good fit for more general-purpose PowerEdge servers. In the future, additional tests are planned with more applications such as RELION, GROMACS and LAMMPS, as well as tests for applications that can leverage mixed precision.

*Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots.

Back to Top

Need more help?
Find additional PowerEdge and PowerVault articles
Watch Part Replacement Videos for Enterprise products

Visit and ask for support in our Communities

Create an online support Request

Article ID: SLN316570

Last Date Modified: 03/18/2019 08:17 AM

Rate this article

Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.