As the successor to the Volta architecture, Turing™ is NVIDIA®’s latest NVIDIA’s family of GPUs. The Turing™ GPU is available with GeForce®, where it is used to render highly realistic games and with Quadro®, accelerating content creation workflows. The NVIDIA® Tesla® series is designed to handle artificial intelligence systems and high performance computing (HPC) workloads in data centers. NVIDIA® Tesla® T4 is the only server-grade GPU with the Turing™ micro-architecture available in the market now, and it is supported by Dell EMC PowerEdge R640, R740, R740xd and R7425 servers. This blog discusses the performance of the new Tesla T4 compared to the latest Volta V100-PCIe on the PowerEdge R740 server for different HPC applications including HOOMD-blue, Amber, NAMD and HPL.
The PowerEdge R740 server is a 2U Intel® Skylake-based rack-mount server that provides an ideal balance of storage, I/O and accelerator support. It supports up to four* single-slot T4 or three dual-slot width V100-PCIe GPUs in x16 PCIe 3.0 slots. Table 1 notes the differences between a single T4 and V100. The Volta™ V100 is available in 16GB or 32GB memory configurations. Since the T4 is only available in a 16GB version, the V100 card with 16GB memory was used to provide comparative performance results. Table 2 lists the hardware and software details of the test bed.
Table 1: The comparison between T4 and V100
Tesla V100-PCIe |
Tesla T4 |
|
---|---|---|
Architecture |
Volta |
Turing |
CUDA cores |
5120 |
2560 |
Tensor cores |
640 |
320 |
Compute capability |
7.0 |
7.5 |
GPU clock |
1245 MHz |
585 MHz |
Boost clock |
1380 MHz |
1590 MHz |
Memory type |
HBM2 |
GDDR6 |
Memory bus |
4096bit |
256bit |
Bandwidth |
900 GB/s |
320GB/s |
Slot width |
Dual-slot |
Single-slot |
FP32 single-precision |
14 TFLOPS |
8.1 TFLOPS |
Mixed-precision (FP16/FP32) |
112 TFLOPS |
65 TFLOPS |
FP64 double-precision |
7 TFLOPS |
254.4 GFLOPS |
TDP |
250 W |
70 W |
Table 2: Details of R740 configuration and software version
Processor |
2x Intel® Xeon ® Gold 6136 @ 3.0GHz, 12c |
---|---|
Memory |
384G(12*32G@2666MHz) |
Local disk |
480G SSD |
Operating system |
Red Hat Enterprise Linux Server release 7.5 |
GPU |
3x V100-PCIe 16 GB or 4x T4 16 GB |
CUDA driver |
410.66 |
CUDA toolkit |
10.0 |
Processor settings >logical processors |
Disabled |
System profiles |
Performance |
HPL |
Compiled with CUDA10.0 |
NAMD |
NAMD_Git-2019-02-11 |
Amber |
18.12 |
HOOMD-blue |
v2.5.0 |
OpenMPI |
4.0.0 |
Figure 1: HOOMD-blue single and double precision performance results with V100s and T4s on the PowerEdge R740 server
HOOMD-blue (for Highly Optimized Object-oriented Many-particle Dynamics - blue) is a general-purpose molecular dynamic simulator. By default, HOOMD-blue is compiled in double precision (FP64) and version 2.5 provides a parameter SINGLE_PRECISION=ON to force to compile it in single precision (FP32). Figure 1 shows the microsphere dataset results for single and double precision. The x-axis is the number of GPUs and the performance metric is hours to run 10e6 steps.
One observation is that the FP64 performance of T4 is relatively low. This is due to the hardware limitation. In theory, T4 can deliver 254 GFLOPS (Refer to Table 1) peak performance in double precision whereas V100 is ~27x better. But the performance of applications like HOOMD-blue, which can be compiled and run with single precision, can have a performance advantage with the FP32 compiling option. The HOOMD-blue community has considered our suggestion about supporting mixed-precision on all HOOMD-blue modules. Once the effort is complete, HOOMD-blue can leverage better on mixed-preciscion supported hardware.
Comparing the single precision performance of T4 and V100, we noticed that V100 is 3x better than T4. This performance is expected from T4 due to the number of CUDA cores and power rating on the accelerator.
GPUs in the PowerEdge R740 server are connected through PCIe. For the three V100 GPU data point, the PCIe bus is saturated due to peer-to-peer communications. This impacts the overall performance, resulting in the same performance as one GPU.
Amber is the collective name for a suite of programs that allows users to carry out molecular dynamics simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force fields that are implemented in this suite. Amber version 18.12 with AmberTools 18.13 is tested with the Amber 18 Benchmark Suite, which includes JAC, Cellulose, FactorIX, STMV, TRPCage, myoglobin and nucleosome dataset.
Figure 2: Amber Explicit Solvent results with V100s and T4s on the PowerEdge R740 server
Figure 3: Amber Implicit Solvent results with V100s and T4s on the PowerEdge R740 server
Figure 2 and Figure 3 show the single card and whole system performance numbers on explicit solvent and implicit solvent, respectively. The data point "system" from the above graph represents the full system aggregate throughput of all GPUs. The PowerEdge R740 server supports three V100s or four T4s, so "system" bars in red and blue are the results with three V100s or four T4s.
The reason for preferring aggregate data of multiple GPU cards is that Pascal and later GPUs do not scale beyond a single accelerator for Amber application. Users generally run multiple simulations in parallel on other GPUs. In terms of performance with a large dataset like STMV (1,067,095 atoms), single T4 is 33 percent and the whole system is 44 percent of the V100’s capability. Datasets like TRPCage (304 atoms only) is too small to make effective use of V100s, therefore performance on it is not much faster than T4 as it is for larger PME runs. As per the result on Amber’s official website, almost all GPUs numbers are three to four times faster than CPU-only runs, so having a T4 card in a server dealing with small datasets will be a good option.
Figure 4: NAMD performance results with V100s and T4s on the PowerEdge R740 server
NAMD is a molecular dynamics code designed for high-performance simulation of large biomolecular systems. In these tests, the prebuilt binary was not used. Instead, NAMD was built with the latest source code (NAMD_Git-2019-02-11) with CUDA 10.0. For best performance, NAMD was compiled with the Intel® compiler and libraries (version 2018u3). Figure 4 plots the performance results using the STMV dataset (1,066,628 atoms, periodic, PME). NAMD doesn’t scale beyond one V100 card, and it scales well with three T4 cards. And single T4 GPU delivers 42 percent of V100’s performance. This is a decent number considering it has only 28 percent of V100’s TDP. T4 could be a choice for datacenters with limited power and cooling capability.
Figure 5: HPL results with V100s and T4s on The PowerEdge R740 server
Figure 5 shows HPL performance on the PowerEdge R740 with multiple V100 or T4 GPUs. As expected, HPL numbers scale well with multiple GPUs for V100 and T4. But T4 performance is significantly less than V100 due to its FP64 limitation. Due to the limited double-precision capability on T4, the performance comparison with V100 is not ideal and Volta V100 remains the best choice for such double precision applications.
In this blog, HPC application performance with HOOMD-blue, Amber, NAMD and HPL was compared between V100 and T4 on the Dell EMC PowerEdge R740. T4 is not only used for deep learning inference, it is also beneficial for HPC applications with single or mixed-precision support. Its low TDP can help speed up traditional data centers where power and cooling capability is limited. T4’s PCIe small form factor makes it a good fit for more general-purpose PowerEdge servers. In the future, additional tests are planned with more applications such as RELION, GROMACS and LAMMPS, as well as tests for applications that can leverage mixed precision.
*Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots.