INFERENCE using the NVIDIA T4

INFERENCE using the NVIDIA T4


Article written by Deepthi Cherlopalle, Xu Rengan, Frank Han and Quy Ta of HPC and AI Innovation Lab in March 2019



Table of Contents:

  1. Introduction
  2. Testbed configuration
  3. NVIDIA TensorRT
  4. Inference performance
  5. Multi-GPU Inference Performance
  6. Accuracy tests
  7. Conclusion

Introduction

There are two main phases in machine learning: Training and Inference. Training the neural networks involves setting the number of epochs, batch size, learning rate and optimizing other hyperparameters. After hours of training a neural network, a static model is generated which can be deployed anywhere. In the next phase, inference, real-world data is fed into the trained model to generate predictions. The goal of training is to build a neural network that has high accuracy, and the goal of inference is to detect/identify quickly, i.e., with an emphasis on latency. There are many options available for inference and in this blog, we introduce the NVIDIA® T4 GPU and compare its performance with the previous generation NVIDIA Pascal™ P4 inference-focused GPU. We also compare against the NVIDIA Volta™ V100, which is a good option for the training phase as well. These tests were conducted on a Dell EMC PowerEdge R740 server.

NVIDIA’s latest GPU based on the Turing™ micro-architecture is the Tesla® T4. This card has been specifically designed for deep learning training and inferencing. NVIDIA T4 is a x16 PCIe Gen3 low profile card. The small form factor makes it easier to install into power edge servers. The Tesla T4 supports a full range of precisions for inference FP32, FP16, INT8 and INT4.

NVIDIA T4 card [Source: NVIDIA Website]

Figure 1: NVIDIA T4 card [Source: NVIDIA website]

The table below compares the performance capabilities of different NVIDIA GPU cards.

GPU Tesla V100-PCIe Tesla T4 Tesla P4 Tesla T4 Vs P4
Architecture Volta Turing Pascal
NVIDIA CUDA Cores 5120 2560 2560 Same number of cores
GPU Clock 1245MHz 585MHz 885MHz
Boost clock 1380MHz 1590MHz 1531MHz
Single Precision Performance (FP 32) 14 TFLOPS 8.1 TFLOPS 5.5 TFLOPS ~1.48x higher
Half-precision Performance (FP 16) 112 TFLOPS 65 TFLOPS N/A *Introduced in T4
Integer Operations (INT 8) 224 TOPS 130 TOPS 22 TOPS 6.5x higher
Integer Operations (INT 4) N/A 260 TOPS N/A *Introduced in T4
GPU memory 16GB 16GB 8GB 2x more
Memory Bandwidth 900 GB/s 320GB/s 192GB/s 1.6x higher
Power 250W 70W 50W/75W

Table 1: Comparison of NVIDIA GPU Cards

The Dell EMC™ PowerEdge™ R740 is a 2U, two socket platform with support for two Intel® Xeon® scalable processors, dense storage options, high-speed interconnects and various GPUs. The PowerEdge R740 can support up to three NVIDIA T4 or NVIDIA P4, or NVIDIA V100 PCIe cards in x16 slots.


Back to Top


Testbed configuration

The following table describes the hardware and software configuration used for the inference study.

Server Dell EMC PowerEdge R740
Processor Dual Intel Xeon Gold 6136 CPU @ 3.00GHz, 12 cores
Memory 384GB @ 2667 MT/s
GPU NVIDIA T4 / NVIDIA P4 / NVIDIA V100
Power Supplies Dual 1600W
BIOS 1.4.5
Operating System RHEL 7.4
Kernel 3.10.0-693.el7.x86_64
System Profile Performance Optimized
CUDA driver 410.66
CUDA toolkit 10.0
TensorRT 5.0.2.6
Image Classification Models AlexNet
GoogleNet
ResNet 50
VGG_19

Table 2: Testbed information


Back to Top


NVIDIA TensorRT

TensorRT is a software platform for deep learning inference which includes an inference optimizer to deliver low latency and high throughput for deep learning applications. It can be used to import trained models from different deep learning frameworks like Pytorch, TensorFlow, mxnet etc. TensorRT version 5 supports Turing GPUs and at the time of publication of this blog, INT4 precision was not supported with the TensorRT version used, and so the performance of INT4 is not discussed in this blog.


Back to Top


Inference performance

Figure 2 plots the inference performance of the pre-trained image recognition models AlexNet, GoogLeNet, ResNet and VGG on three different GPUs, NVIDIA T4, P4 and V100. Each test was run on a single GPU of each kind. The metric for performance is images per second, and the graphs plot 1000s of images per second. These image recognition models are tested with TensorRT software for different precision methods INT8, FP16 and FP32. NVIDIA P4 does not support half precision so the graphs below do not show the data point. A higher value indicates better performance. A batch size of 128 was used for these test cases. Since NVIDIA T4 card has 16GB memory, we chose V100 GPU with 16GB memory to have a fair comparison in performance.

AlexNet

GoogLeNet

ResNet-50

VGG-19

Figure 2 Inference performance on different image classification models

  • The T4 is ~1.4x – 2.8x better than P4 when using INT8 precision. Even though the number of CUDA cores is similar between T4 and P4, the increased Tera operations per second (TOPS) for INT8 precision provides improved performance with T4.
  • The V100 is ~1.1x – 2.1x better than T4 when using INT8 precision. When we compare FP16 precision for T4 and V100, the V100 performs ~3x - 4x better than T4, and the improvement varies depending on the dataset. This is the expected performance from a T4 card which has half the CUDA cores and one-third the wattage of Volta V100 making T4 a compelling solution for use cases where reduced power consumption is key.
  • Comparing INT8 and FP32 precision for T4, ~4.6x – 9.5x speedup was measured when we use INT8 precision for the tests.


Back to Top


Multi-GPU Inference Performance

Multiple GPUs in a system will be able to handle multiple inference jobs simultaneously to provide high throughput in addition to low latency. Since there is no communication between the GPUs when multiple inference processes are launched, a linear speedup is expected. This test was conducted on NVIDIA P40s which was published in one of our earlier blogs and we expect that multi-inference performance for T4 to scale linearly as well. Multi T4 inference tests is planned as future work for this project.


Back to Top


Accuracy tests

This section compares the accuracy of different precision method including INT8, FP16 and FP32. From the inference tests in Figure 2 with TensorRT, INT8 was measured to be 4.5x – 9.5x faster than FP32 across the different image recognition models. The goal is to validate that this faster performance does not come at the expense of accuracy.

Several pre-trained models were used in our benchmarking, including AlexNet, GoogLeNet, ResNet-50, ResNet-101. The binary used for this test is part of TensorRT. All models use the same validation dataset which contains 50000 images and is divided into 2000 batches of 25 images. The first 50 batches are used for calibration purpose and the rest were used for accuracy measurement.

Top-1 accuracy is the probability that the model will correctly classify the image. Top-5 accuracy is the probability that the model will classify the image in 1 of the top 5 highest probability categories. Table 3 shows the results of the accuracy tests. Within 0.5% accuracy loss is measured between INT8 and FP32 while up to 9x performance improvement can be achieved when using INT8 precision.

Table 3 shows the accuracy tests on different image classification models:

FP32 INT8 Difference between FP32 and INT8 precision
Network TOP 1 TOP 5 TOP 1 TOP 5 TOP 1 TOP 5
AlexNet 56.82% 79.99% 56.76% 79.97% 0.07% 0.02%
GoogLeNet 68.95% 89.12% 68.75% 88.99% 0.2% 0.13%
Resnet_101 74.33% 91.95% 74.34% 91.85% -0.02% 0.1%
Resnet_50 72.9% 91.14% 72.77% 91.06% 0.13% 0.08%

Table 3: Accuracy tests on different image classification models


Back to Top


Conclusion

This blog introduces the NVIDIA T4 Inference card and describes the inference performance of different image recognition models with the T4, P4 and V100 GPUs. The small PCIe form factor and low wattage of the T4 card makes it easy to use in Dell EMC PowerEdge systems. Comparing INT8 precision for the new T4 and previous P4, a 1.5x -2.7x performance improvement was measured on the T4. The accuracy tests demonstrated minimal difference between FP32, FP16 and INT8, with up to 9.5x speed up when using INT8 precision.


Back to Top


Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots
Need more help?
Find additional PowerEdge and PowerVault articles
Watch Part Replacement Videos for Enterprise products

Visit and ask for support in our Communities

Create an online support Request


Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.

Article ID: SLN316556

Last Date Modified: 03/18/2019 10:16 AM


Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.