Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Create and access a list of your products

INFERENCE using the NVIDIA T4

Summary: HPC, High Performance Computing, HPC and AI Innovation Lab, Dell EMC, PowerEdge R740, Performance, NVIDIA T4, CUDA, TensorRT, AlexNet, GoogLeNet, ResNet 50, VGG-19

This article applies to   This article does not apply to 

Symptoms

Article written by Deepthi Cherlopalle, Xu Rengan, Frank Han and Quy Ta of HPC and AI Innovation Lab in March 2019

Cause

-

Resolution


Table of Contents:

  1. Introduction
  2. Testbed configuration
  3. NVIDIA TensorRT
  4. Inference performance
  5. Multi-GPU Inference Performance
  6. Accuracy tests
  7. Conclusion

 

Introduction

 

There are two main phases in machine learning: Training and Inference. Training the neural networks involves setting the number of epochs, batch size, learning rate and optimizing other hyperparameters.  After hours of training a neural network, a static model is generated which can be deployed anywhere. In the next phase, inference, real-world data is fed into the trained model to generate predictions. The goal of training is to build a neural network that has high accuracy, and the goal of inference is to detect/identify quickly, i.e., with an emphasis on latency. There are many options available for inference and in this blog, we introduce the NVIDIA® T4 GPU and compare its performance with the previous generation NVIDIA Pascal™ P4 inference-focused GPU. We also compare against the NVIDIA Volta™ V100, which is a good option for the training phase as well. These tests were conducted on a Dell EMC PowerEdge R740 server.

NVIDIA’s latest GPU based on the Turing™ micro-architecture is the Tesla® T4. This card has been specifically designed for deep learning training and inferencing. NVIDIA T4 is a x16 PCIe Gen3 low profile card. The small form factor makes it easier to install into power edge servers. The Tesla T4 supports a full range of precisions for inference FP32, FP16, INT8 and INT4.

SLN316556_en_US__1image001

Figure 1: NVIDIA T4 card [Source: NVIDIA website]

The table below compares the performance capabilities of different NVIDIA GPU cards.

 
GPU Tesla V100-PCIe Tesla T4 Tesla P4 Tesla T4 Vs P4
Architecture Volta Turing Pascal  
NVIDIA CUDA Cores 5120 2560 2560 Same number of cores
GPU Clock 1245MHz 585MHz 885MHz  
Boost clock 1380MHz 1590MHz 1531MHz  
Single Precision Performance (FP 32) 14 TFLOPS 8.1 TFLOPS 5.5 TFLOPS ~1.48x higher
Half-precision Performance (FP 16) 112 TFLOPS 65 TFLOPS N/A *Introduced in T4
Integer Operations (INT 8) 224 TOPS 130 TOPS 22 TOPS 6.5x higher
Integer Operations (INT 4) N/A 260 TOPS N/A *Introduced in T4
GPU memory 16GB 16GB 8GB 2x more
Memory Bandwidth 900 GB/s 320GB/s 192GB/s 1.6x higher
Power 250W 70W 50W/75W  

Table 1: Comparison of NVIDIA GPU Cards

The Dell EMC™ PowerEdge™ R740 is a 2U, two socket platform with support for two Intel® Xeon® scalable processors, dense storage options, high-speed interconnects and various GPUs. The PowerEdge R740 can support up to three NVIDIA T4 or NVIDIA P4, or NVIDIA V100 PCIe cards in x16 slots.


Back to Top


 

Testbed configuration

 

The following table describes the hardware and software configuration used for the inference study.

Server Dell EMC PowerEdge R740
Processor Dual Intel Xeon Gold 6136 CPU @ 3.00GHz, 12 cores
Memory 384GB @ 2667 MT/s
GPU NVIDIA T4 / NVIDIA P4 / NVIDIA V100
Power Supplies Dual 1600W
BIOS 1.4.5
Operating System RHEL 7.4
Kernel 3.10.0-693.el7.x86_64
System Profile Performance Optimized
CUDA driver 410.66
CUDA toolkit 10.0
TensorRT 5.0.2.6
Image Classification Models AlexNet
GoogleNet
ResNet 50
VGG_19

Table 2: Testbed information


Back to Top


 

NVIDIA TensorRT

 

TensorRT is a software platform for deep learning inference which includes an inference optimizer to deliver low latency and high throughput for deep learning applications. It can be used to import trained models from different deep learning frameworks like Pytorch, TensorFlow, mxnet etc.  TensorRT version 5 supports Turing GPUs and at the time of publication of this blog, INT4 precision was not supported with the TensorRT version used, and so the performance of INT4 is not discussed in this blog.


Back to Top


 

Inference performance

 

Figure 2 plots the inference performance of the pre-trained image recognition models AlexNet, GoogLeNet, ResNet and VGG on three different GPUs, NVIDIA T4, P4 and V100. Each test was run on a single GPU of each kind. The metric for performance is images per second, and the graphs plot 1000s of images per second. These image recognition models are tested with TensorRT software for different precision methods INT8, FP16 and FP32. NVIDIA P4 does not support half precision so the graphs below do not show the data point. A higher value indicates better performance. A batch size of 128 was used for these test cases. Since NVIDIA T4 card has 16GB memory, we chose V100 GPU with 16GB memory to have a fair comparison in performance.

SLN316556_en_US__2image002

SLN316556_en_US__3image003

SLN316556_en_US__4image004

SLN316556_en_US__5image005

Figure 2  Inference performance on different image classification models

  • The T4 is ~1.4x – 2.8x better than P4 when using INT8 precision. Even though the number of CUDA cores is similar between T4 and P4, the increased Tera operations per second (TOPS) for INT8 precision provides improved performance with T4.
  • The V100 is ~1.1x – 2.1x better than T4 when using INT8 precision. When we compare FP16 precision for T4 and V100, the V100 performs ~3x - 4x better than T4, and the improvement varies depending on the dataset. This is the expected performance from a T4 card which has half the CUDA cores and one-third the wattage of Volta V100 making T4 a compelling solution for use cases where reduced power consumption is key.
  •  Comparing INT8 and FP32 precision for T4, ~4.6x – 9.5x speedup was measured when we use INT8 precision for the tests.


Back to Top


 

Multi-GPU Inference Performance

 

Multiple GPUs in a system will be able to handle multiple inference jobs simultaneously to provide high throughput in addition to low latency.  Since there is no communication between the GPUs when multiple inference processes are launched, a linear speedup is expected. This test was conducted on NVIDIA P40s which was published in one of our earlier blogs and we expect that multi-inference performance for T4 to scale linearly as well. Multi T4 inference tests is planned as future work for this project.


Back to Top


 

Accuracy tests

 

This section compares the accuracy of different precision method including INT8, FP16 and FP32. From the inference tests in Figure 2 with TensorRT, INT8 was measured to be 4.5x – 9.5x faster than FP32 across the different image recognition models.  The goal is to validate that this faster performance does not come at the expense of accuracy.

Several pre-trained models were used in our benchmarking, including AlexNet, GoogLeNet, ResNet-50, ResNet-101. The binary used for this test is part of TensorRT. All models use the same validation dataset which contains 50000 images and is divided into 2000 batches of 25 images. The first 50 batches are used for calibration purpose and the rest were used for accuracy measurement.

Top-1 accuracy is the probability that the model will correctly classify the image. Top-5 accuracy is the probability that the model will classify the image in 1 of the top 5 highest probability categories. Table 3 shows the results of the accuracy tests. Within 0.5% accuracy loss is measured between INT8 and FP32 while up to 9x performance improvement can be achieved when using INT8 precision.

Table 3 shows the accuracy tests on different image classification models:

 
  FP32 INT8 Difference between FP32 and INT8 precision
Network TOP 1 TOP 5 TOP 1 TOP 5 TOP 1 TOP 5
AlexNet 56.82% 79.99% 56.76% 79.97% 0.07% 0.02%
GoogLeNet 68.95% 89.12% 68.75% 88.99% 0.2% 0.13%
Resnet_101 74.33% 91.95% 74.34% 91.85% -0.02% 0.1%
Resnet_50 72.9% 91.14% 72.77% 91.06% 0.13% 0.08%

Table 3: Accuracy tests on different image classification models


Back to Top


 

Conclusion

 

This blog introduces the NVIDIA T4 Inference card and describes the inference performance of different image recognition models with the T4, P4 and V100 GPUs. The small PCIe form factor and low wattage of the T4 card makes it easy to use in Dell EMC PowerEdge systems. Comparing INT8 precision for the new T4 and previous P4, a 1.5x -2.7x performance improvement was measured on the T4. The accuracy tests demonstrated minimal difference between FP32, FP16 and INT8, with up to 9.5x speed up when using INT8 precision.


Back to Top


 
Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots

Affected Products

High Performance Computing Solution Resources, PowerEdge R740