INFERENCE using the NVIDIA T4

Article written by Deepthi Cherlopalle, Xu Rengan, Frank Han and Quy Ta of HPC and AI Innovation Lab in March 2019

-

Introduction

There are two main phases in machine learning: Training and Inference. Training the neural networks involves setting the number of epochs, batch size, learning rate and optimizing other hyperparameters. After hours of training a neural network, a static model is generated which can be deployed anywhere. In the next phase, inference, real-world data is fed into the trained model to generate predictions. The goal of training is to build a neural network that has high accuracy, and the goal of inference is to detect/identify quickly, i.e., with an emphasis on latency. There are many options available for inference and in this blog, we introduce the NVIDIA® T4 GPU and compare its performance with the previous generation NVIDIA Pascal™ P4 inference-focused GPU. We also compare against the NVIDIA Volta™ V100, which is a good option for the training phase as well. These tests were conducted on a Dell EMC PowerEdge R740 server.

NVIDIA’s latest GPU based on the Turing™ micro-architecture is the Tesla® T4. This card has been specifically designed for deep learning training and inferencing. NVIDIA T4 is a x16 PCIe Gen3 low profile card. The small form factor makes it easier to install into power edge servers. The Tesla T4 supports a full range of precisions for inference FP32, FP16, INT8 and INT4.

SLN316556_en_US__1image001

Figure 1: NVIDIA T4 card [Source: NVIDIA website]

The table below compares the performance capabilities of different NVIDIA GPU cards.

GPU	Tesla V100-PCIe	Tesla T4	Tesla P4	Tesla T4 Vs P4
Architecture	Volta	Turing	Pascal
NVIDIA CUDA Cores	5120	2560	2560	Same number of cores
GPU Clock	1245MHz	585MHz	885MHz
Boost clock	1380MHz	1590MHz	1531MHz
Single Precision Performance (FP 32)	14 TFLOPS	8.1 TFLOPS	5.5 TFLOPS	~1.48x higher
Half-precision Performance (FP 16)	112 TFLOPS	65 TFLOPS	N/A	*Introduced in T4
Integer Operations (INT 8)	224 TOPS	130 TOPS	22 TOPS	6.5x higher
Integer Operations (INT 4)	N/A	260 TOPS	N/A	*Introduced in T4
GPU memory	16GB	16GB	8GB	2x more
Memory Bandwidth	900 GB/s	320GB/s	192GB/s	1.6x higher
Power	250W	70W	50W/75W

Table 1: Comparison of NVIDIA GPU Cards

The Dell EMC™ PowerEdge™ R740 is a 2U, two socket platform with support for two Intel® Xeon® scalable processors, dense storage options, high-speed interconnects and various GPUs. The PowerEdge R740 can support up to three NVIDIA T4 or NVIDIA P4, or NVIDIA V100 PCIe cards in x16 slots.

Back to Top

Testbed configuration

The following table describes the hardware and software configuration used for the inference study.

Server	Dell EMC PowerEdge R740
Processor	Dual Intel Xeon Gold 6136 CPU @ 3.00GHz, 12 cores
Memory	384GB @ 2667 MT/s
GPU	NVIDIA T4 / NVIDIA P4 / NVIDIA V100
Power Supplies	Dual 1600W
BIOS	1.4.5
Operating System	RHEL 7.4
Kernel	3.10.0-693.el7.x86_64
System Profile	Performance Optimized
CUDA driver	410.66
CUDA toolkit	10.0
TensorRT	5.0.2.6
Image Classification Models	AlexNet GoogleNet ResNet 50 VGG_19

Table 2: Testbed information

Back to Top

NVIDIA TensorRT

TensorRT is a software platform for deep learning inference which includes an inference optimizer to deliver low latency and high throughput for deep learning applications. It can be used to import trained models from different deep learning frameworks like Pytorch, TensorFlow, mxnet etc. TensorRT version 5 supports Turing GPUs and at the time of publication of this blog, INT4 precision was not supported with the TensorRT version used, and so the performance of INT4 is not discussed in this blog.

Back to Top

Inference performance

Figure 2 plots the inference performance of the pre-trained image recognition models AlexNet, GoogLeNet, ResNet and VGG on three different GPUs, NVIDIA T4, P4 and V100. Each test was run on a single GPU of each kind. The metric for performance is images per second, and the graphs plot 1000s of images per second. These image recognition models are tested with TensorRT software for different precision methods INT8, FP16 and FP32. NVIDIA P4 does not support half precision so the graphs below do not show the data point. A higher value indicates better performance. A batch size of 128 was used for these test cases. Since NVIDIA T4 card has 16GB memory, we chose V100 GPU with 16GB memory to have a fair comparison in performance.

SLN316556_en_US__2image002

SLN316556_en_US__3image003

SLN316556_en_US__4image004

SLN316556_en_US__5image005

Figure 2 Inference performance on different image classification models

The T4 is ~1.4x – 2.8x better than P4 when using INT8 precision. Even though the number of CUDA cores is similar between T4 and P4, the increased Tera operations per second (TOPS) for INT8 precision provides improved performance with T4.
The V100 is ~1.1x – 2.1x better than T4 when using INT8 precision. When we compare FP16 precision for T4 and V100, the V100 performs ~3x - 4x better than T4, and the improvement varies depending on the dataset. This is the expected performance from a T4 card which has half the CUDA cores and one-third the wattage of Volta V100 making T4 a compelling solution for use cases where reduced power consumption is key.
Comparing INT8 and FP32 precision for T4, ~4.6x – 9.5x speedup was measured when we use INT8 precision for the tests.

Back to Top

Multi-GPU Inference Performance

Multiple GPUs in a system will be able to handle multiple inference jobs simultaneously to provide high throughput in addition to low latency. Since there is no communication between the GPUs when multiple inference processes are launched, a linear speedup is expected. This test was conducted on NVIDIA P40s which was published in one of our earlier blogs and we expect that multi-inference performance for T4 to scale linearly as well. Multi T4 inference tests is planned as future work for this project.

Back to Top

Accuracy tests

This section compares the accuracy of different precision method including INT8, FP16 and FP32. From the inference tests in Figure 2 with TensorRT, INT8 was measured to be 4.5x – 9.5x faster than FP32 across the different image recognition models. The goal is to validate that this faster performance does not come at the expense of accuracy.

Several pre-trained models were used in our benchmarking, including AlexNet, GoogLeNet, ResNet-50, ResNet-101. The binary used for this test is part of TensorRT. All models use the same validation dataset which contains 50000 images and is divided into 2000 batches of 25 images. The first 50 batches are used for calibration purpose and the rest were used for accuracy measurement.

Top-1 accuracy is the probability that the model will correctly classify the image. Top-5 accuracy is the probability that the model will classify the image in 1 of the top 5 highest probability categories. Table 3 shows the results of the accuracy tests. Within 0.5% accuracy loss is measured between INT8 and FP32 while up to 9x performance improvement can be achieved when using INT8 precision.

Table 3 shows the accuracy tests on different image classification models:

	FP32		INT8		Difference between FP32 and INT8 precision
Network	TOP 1	TOP 5	TOP 1	TOP 5	TOP 1	TOP 5
AlexNet	56.82%	79.99%	56.76%	79.97%	0.07%	0.02%
GoogLeNet	68.95%	89.12%	68.75%	88.99%	0.2%	0.13%
Resnet_101	74.33%	91.95%	74.34%	91.85%	-0.02%	0.1%
Resnet_50	72.9%	91.14%	72.77%	91.06%	0.13%	0.08%

Table 3: Accuracy tests on different image classification models

Back to Top

Conclusion

This blog introduces the NVIDIA T4 Inference card and describes the inference performance of different image recognition models with the T4, P4 and V100 GPUs. The small PCIe form factor and low wattage of the T4 card makes it easy to use in Dell EMC PowerEdge systems. Comparing INT8 precision for the new T4 and previous P4, a 1.5x -2.7x performance improvement was measured on the T4. The accuracy tests demonstrated minimal difference between FP32, FP16 and INT8, with up to 9.5x speed up when using INT8 precision.

Back to Top

Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots

Summary: HPC, High Performance Computing, HPC and AI Innovation Lab, Dell EMC, PowerEdge R740, Performance, NVIDIA T4, CUDA, TensorRT, AlexNet, GoogLeNet, ResNet 50, VGG-19

Symptoms

Cause

Resolution

Table of Contents:

Introduction

Testbed configuration

NVIDIA TensorRT

Inference performance

Multi-GPU Inference Performance

Accuracy tests

Conclusion

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

INFERENCE using the NVIDIA T4

Summary: HPC, High Performance Computing, HPC and AI Innovation Lab, Dell EMC, PowerEdge R740, Performance, NVIDIA T4, CUDA, TensorRT, AlexNet, GoogLeNet, ResNet 50, VGG-19

Detailed Article

Symptoms

Cause

Resolution

Affected Products

Symptoms

Cause

Resolution

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services