Deep Learning Performance on T4 GPUs with MLPerf Benchmarks

Deep Learning Performance on T4 GPUs with MLPerf Benchmarks

Article written by Rengan Xu, Frank Han, and Quy Ta of HPC and AI Innovation Lab in March 2019

Table of Contents:

  1. Abstract
  2. Overview
  3. Performance Evaluation
  4. Conclusions and Future Work


Turing architecture is NVIDIA’s latest GPU architecture after Volta architecture and the new T4 is based on Turing architecture. It was designed for High-Performance Computing (HPC), deep learning training and inference, machine learning, data analytics, and graphics. This blog will quantify the deep learning training performance of T4 GPUs on Dell EMC PowerEdge R740 server with MLPerf benchmark suite. MLPerf performance on T4 will also be compared to V100-PCIe on the same server with the same software.

Back to Top


The Dell EMC PowerEdge R740 is a 2-socket, 2U rack server. The system features Intel Skylake processors, up to 24 DIMMs, and up to 3 double width V100-PCIe or 4 single width T4 GPUs in x16 PCIe 3.0 slots. T4 is the GPU that uses NVIDIA’s latest Turing architecture. The specification differences of T4 and V100-PCIe GPU are listed in Table 1. MLPerf was chosen to evaluate the performance of T4 in deep learning training. MLPerf is a benchmarking tool that was assembled by a diverse group from academia and industry including Google, Baidu, Intel, AMD, Harvard, and Stanford etc., to measure the speed and performance of machine learning software and hardware. The initial released version is v0.5 and it covers model implementations in different machine learning domains including image classification, object detection and segmentation, machine translation and reinforcement learning. The summary of MLPerf benchmarks used for this evaluation is shown in Table 2. The ResNet-50 TensorFlow implementation from Google’s submission was used, and all other models’ implementations from NVIDIA’s submission were used. All benchmarks were run on bare-metal without a container. Table 3 lists the hardware and software used for the evaluation. The T4 performance with MLPerf benchmarks will be compared to V100-PCIe.

Tesla V100-PCIe Tesla T4
Architecture Volta Turing
CUDA Cores 5120 2560
Tensor Cores 640 320
Compute Capability 7.0 7.5
GPU Clock 1245 MHz 585 MHz
Boost Clock 1380 MHz 1590 MHz
Memory Type HBM2 GDDR6
Memory Size 16GB/32GB 16GB
Bandwidth 900GB/s 320GB/s
Slot Width Dual-Slot Single-Slot
Single-Precision (FP32) 14 TFLOPS 8.1 TFLOPS
Mixed-Precision (FP16/FP32) 112 TFLOPS 65 TFLOPS
Double-Precision (FP64) 7 TFLOPS 254.4 GFLOPS
TDP 250 W 70 W

Table 1: The comparison between T4 and V100-PCIe

Image Classification Object Classification Object Instance Segmentation Translation (Recurrent) Tanslation (Non-Recurrent) Recommendation
Data ImageNet COCO COCO WMT E-G WMT E-G MovieLens-20M
Data Size 144GB 20GB 20GB 37GB 1.3GB 306MB
Model ResNet-50 v1.5 Single-Stage Detector (SSD) Mask-R-CNN GNMT Transformer NCF
Framework TensorFlow PyTorch PyTorch PyTorch PyTorch PyTorch

Table 2: The MLF Perf benchmarks used in the evaluation

Platform PowerEdge R740
CPU 2x Intel Xeon Gold 6136 @3.0GHz (SkyLake)
Memory 384GB DDR4 @ 2666MHz
Storage 782TB Lustre
GPU T4, V100-PCIe
OS and Firmware
Operating System RHEL 7.5 x86_64
Linux Kernal 3.10.0-693.el7.x86_64
BIOS 1.6.12
Deep Learning Related
CUDA compiler and GPU driver CUDA 10.0.130 (410.66)
CUDNN 7.4.1
NCCL 2.3.7
TensorFlow nightly-gpu-dev20190130
PyTorch 1.0.0
MLPerf V0.5

Table 3: The hardware configuration and software details

Back to Top

Performance Evaluation

Figure 1 shows the performance results of MLPerf on T4 and V100-PCIe on PowerEdge R740 server. Six benchmarks from MLPerf are included. For each benchmark, the end-to-end model training was performed to reach the target model accuracy defined by MLPerf committee. The training time in minutes was recorded for each benchmark. The following conclusions can be made based on these results:

  • The ResNet-50 v1.5, SSD and Mask-R-CNN models scale well with increasing number of GPUs. For ResNet-50 v1.5, V100-PCIe is 3.6x faster than T4. For SSD, V100-PCI is 3.3x – 3.4x faster than T4. For Mask-R-CNN, V100-PCIe is 2.2x – 2.7x faster than T4. With the same number of GPUs, each model almost takes the same number of epochs to converge for T4 and V100-PCIe.

  • For GNMT model, the super-linear speedup was observed when more T4 GPUs were used. Compared to one T4, the speedup is 3.1x with two T4, and 10.4x with four T4. This is because the model convergence is impacted by the random seed which is used for training data shuffling and neural network weights initialization. No matter how many GPUs are used, with different random seeds, the model may need different number of epochs to converge. In this experiment, the model took 12, 7, 5, and 4 epochs to converge with 1, 2, 3, and 4 T4, respectively. And the model took 16, 12, and 9 epochs to converge with 1, 2, and 3 V100-PCIe, respectively. Since the number of epochs are significantly different even with the same number of T4 and V100 GPUs, the performance can’t be directly compared. In this scenario, the throughput metric is a fair comparison since it does not depend on the random seed. Figure 2 shows the throughput comparison for both T4 and V100-PCIe. With the same number of GPUs, V100-PCIe is 2.5x – 3.6x faster than T4.

  • The NCF model and Transformer model have the same issue as GNMT. For NCF model, the dataset size is small and the model does not take long to converge; therefore, this issue is not obvious to notice in the result figure. The Transformer model has the same issue when one GPU is used, as the model took 12 epochs to converge with one T4, but only took 8 epochs to converge with one V100-PCIe. When two or more GPUs are used, the model took 4 epochs to converge no matter how many GPUs are used, or which GPU type is used. V100-PCIe is 2.6x - 2.8x faster than T4 in these cases.

Figure 1: MLPerf results on T4 and V100-PCIe

Figure 2: The throughput comparison for GNMT model

Back to Top

Conclusions and Future Work

In this blog, we evaluated the performance of T4 GPUs on Dell EMC PowerEdge R740 server using various MLPerf benchmarks. The T4’s performance was compared to V100-PCIe using the same server and software. Overall, V100-PCIe is 2.2x – 3.6x faster than T4 depending on the characteristics of each benchmark. One observation is that some models are stable no matter what random seed values are used, but other models including GNMT, NCF and Transformer are highly impacted by random seed. In future work, we will finetune the hyper-parameters to make the unstable models converge with less epochs. We will also run MLPerf on more GPUs and more nodes to evaluate the scalability of those models on PowerEdge servers.

*Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell EMC PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots.

Back to Top

Need more help?
Find additional PowerEdge and PowerVault articles
Watch Part Replacement Videos for Enterprise products

Visit and ask for support in our Communities

Create an online support Request

Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.

Article ID: SLN316560

Last Date Modified: 04/01/2019 02:57 AM

Rate this article

Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.