Article written by Deepthi Cherlopalle, Xu Rengan, Frank Han and Quy Ta of HPC and AI Innovation Lab in March 2019
There are two main phases in machine learning: Training and Inference. Training the neural networks involves setting the number of epochs, batch size, learning rate and optimizing other hyperparameters. After hours of training a neural network, a static model is generated which can be deployed anywhere. In the next phase, inference, real-world data is fed into the trained model to generate predictions. The goal of training is to build a neural network that has high accuracy, and the goal of inference is to detect/identify quickly, i.e., with an emphasis on latency. There are many options available for inference and in this blog, we introduce the NVIDIA® T4 GPU and compare its performance with the previous generation NVIDIA Pascal™ P4 inference-focused GPU. We also compare against the NVIDIA Volta™ V100, which is a good option for the training phase as well. These tests were conducted on a Dell EMC PowerEdge R740 server.
NVIDIA’s latest GPU based on the Turing™ micro-architecture is the Tesla® T4. This card has been specifically designed for deep learning training and inferencing. NVIDIA T4 is a x16 PCIe Gen3 low profile card. The small form factor makes it easier to install into power edge servers. The Tesla T4 supports a full range of precisions for inference FP32, FP16, INT8 and INT4.
Figure 1: NVIDIA T4 card [Source: NVIDIA website]
The table below compares the performance capabilities of different NVIDIA GPU cards.
GPU | Tesla V100-PCIe | Tesla T4 | Tesla P4 | Tesla T4 Vs P4 |
---|---|---|---|---|
Architecture | Volta | Turing | Pascal | |
NVIDIA CUDA Cores | 5120 | 2560 | 2560 | Same number of cores |
GPU Clock | 1245MHz | 585MHz | 885MHz | |
Boost clock | 1380MHz | 1590MHz | 1531MHz | |
Single Precision Performance (FP 32) | 14 TFLOPS | 8.1 TFLOPS | 5.5 TFLOPS | ~1.48x higher |
Half-precision Performance (FP 16) | 112 TFLOPS | 65 TFLOPS | N/A | *Introduced in T4 |
Integer Operations (INT 8) | 224 TOPS | 130 TOPS | 22 TOPS | 6.5x higher |
Integer Operations (INT 4) | N/A | 260 TOPS | N/A | *Introduced in T4 |
GPU memory | 16GB | 16GB | 8GB | 2x more |
Memory Bandwidth | 900 GB/s | 320GB/s | 192GB/s | 1.6x higher |
Power | 250W | 70W | 50W/75W |
Table 1: Comparison of NVIDIA GPU Cards
The Dell EMC™ PowerEdge™ R740 is a 2U, two socket platform with support for two Intel® Xeon® scalable processors, dense storage options, high-speed interconnects and various GPUs. The PowerEdge R740 can support up to three NVIDIA T4 or NVIDIA P4, or NVIDIA V100 PCIe cards in x16 slots.
The following table describes the hardware and software configuration used for the inference study.
Server | Dell EMC PowerEdge R740 |
---|---|
Processor | Dual Intel Xeon Gold 6136 CPU @ 3.00GHz, 12 cores |
Memory | 384GB @ 2667 MT/s |
GPU | NVIDIA T4 / NVIDIA P4 / NVIDIA V100 |
Power Supplies | Dual 1600W |
BIOS | 1.4.5 |
Operating System | RHEL 7.4 |
Kernel | 3.10.0-693.el7.x86_64 |
System Profile | Performance Optimized |
CUDA driver | 410.66 |
CUDA toolkit | 10.0 |
TensorRT | 5.0.2.6 |
Image Classification Models | AlexNet GoogleNet ResNet 50 VGG_19 |
Table 2: Testbed information
TensorRT is a software platform for deep learning inference which includes an inference optimizer to deliver low latency and high throughput for deep learning applications. It can be used to import trained models from different deep learning frameworks like Pytorch, TensorFlow, mxnet etc. TensorRT version 5 supports Turing GPUs and at the time of publication of this blog, INT4 precision was not supported with the TensorRT version used, and so the performance of INT4 is not discussed in this blog.
Figure 2 plots the inference performance of the pre-trained image recognition models AlexNet, GoogLeNet, ResNet and VGG on three different GPUs, NVIDIA T4, P4 and V100. Each test was run on a single GPU of each kind. The metric for performance is images per second, and the graphs plot 1000s of images per second. These image recognition models are tested with TensorRT software for different precision methods INT8, FP16 and FP32. NVIDIA P4 does not support half precision so the graphs below do not show the data point. A higher value indicates better performance. A batch size of 128 was used for these test cases. Since NVIDIA T4 card has 16GB memory, we chose V100 GPU with 16GB memory to have a fair comparison in performance.
Figure 2 Inference performance on different image classification models
Multiple GPUs in a system will be able to handle multiple inference jobs simultaneously to provide high throughput in addition to low latency. Since there is no communication between the GPUs when multiple inference processes are launched, a linear speedup is expected. This test was conducted on NVIDIA P40s which was published in one of our earlier blogs and we expect that multi-inference performance for T4 to scale linearly as well. Multi T4 inference tests is planned as future work for this project.
This section compares the accuracy of different precision method including INT8, FP16 and FP32. From the inference tests in Figure 2 with TensorRT, INT8 was measured to be 4.5x – 9.5x faster than FP32 across the different image recognition models. The goal is to validate that this faster performance does not come at the expense of accuracy.
Several pre-trained models were used in our benchmarking, including AlexNet, GoogLeNet, ResNet-50, ResNet-101. The binary used for this test is part of TensorRT. All models use the same validation dataset which contains 50000 images and is divided into 2000 batches of 25 images. The first 50 batches are used for calibration purpose and the rest were used for accuracy measurement.
Top-1 accuracy is the probability that the model will correctly classify the image. Top-5 accuracy is the probability that the model will classify the image in 1 of the top 5 highest probability categories. Table 3 shows the results of the accuracy tests. Within 0.5% accuracy loss is measured between INT8 and FP32 while up to 9x performance improvement can be achieved when using INT8 precision.
Table 3 shows the accuracy tests on different image classification models:
FP32 | INT8 | Difference between FP32 and INT8 precision | ||||
---|---|---|---|---|---|---|
Network | TOP 1 | TOP 5 | TOP 1 | TOP 5 | TOP 1 | TOP 5 |
AlexNet | 56.82% | 79.99% | 56.76% | 79.97% | 0.07% | 0.02% |
GoogLeNet | 68.95% | 89.12% | 68.75% | 88.99% | 0.2% | 0.13% |
Resnet_101 | 74.33% | 91.95% | 74.34% | 91.85% | -0.02% | 0.1% |
Resnet_50 | 72.9% | 91.14% | 72.77% | 91.06% | 0.13% | 0.08% |
Table 3: Accuracy tests on different image classification models
This blog introduces the NVIDIA T4 Inference card and describes the inference performance of different image recognition models with the T4, P4 and V100 GPUs. The small PCIe form factor and low wattage of the T4 card makes it easy to use in Dell EMC PowerEdge systems. Comparing INT8 precision for the new T4 and previous P4, a 1.5x -2.7x performance improvement was measured on the T4. The accuracy tests demonstrated minimal difference between FP32, FP16 and INT8, with up to 9.5x speed up when using INT8 precision.
Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots