Deep Learning Performance on T4 GPUs with MLPerf Inference v0.5 Benchmarks

Deep Learning Performance on T4 GPUs with MLPerf Inference v0.5 Benchmarks

Ariticle written by Rengan Xu, Frank Han and Quy Ta of HPC and AI Innovation Lab in November 2019.


The deep learning inference performance has been evaluated on Dell EMC PowerEdge R740, using MLPerf inference v0.5 benchmarks. The performance evaluation was performed on 4x Nvidia Tesla T4 GPUs within one R740 server. The results indicated that the system delivered the top inference performance normalized to processor count among commercially available results.


Inference is the goal of deep learning after neural network model training. Inferencing can be done in data centers, at the edge and in IoT devices. Each of these environments have different requirements, therefore it is difficult to evaluate their performance with a unified benchmark. MLPerf is the new industry standard benchmark suite with the goal of measuring both training and inference performance on machine learning systems. The first MLPerf inference v0.5 benchmarks and results were published recently. Table 1 lists all benchmarks and datasets available in MLPerf inference v0.5.

In the MLPerf inference evaluation framework, a load generator called LoadGen sends inference queries to the system under test (SUT), and then the SUT utilizes a backend (e.g. TensorRT, TensorFlow, PyTorch) to do the inferencing and sends the results back to LoadGen. There are four scenarios regarding how the queries are sent and received:

  • Server: The queries are sent to the SUT following a Poisson distribution (to model real-world random events). One query has one sample. The metric is queries per second (QPS) within latency bound.
  • Offline: One query with all samples is sent to the SUT. The SUT can send the results back once or multiple times in any order. The metric is sample per second.
  • Single-Stream: One sample per query is sent to SUT. The next query will not be sent until the previous response is received. The metric is 90th percentile latency.
  • Multi-Stream: A query with N samples is sent with fixed interval. The metric is max N when the latency of all queries is within a latency bound.

The detailed inference rules and the latency constraints are described here. This blog only focuses on Server and Offline scenarios as they are designed more towards data center environments, while Single-Stream and Multi-Stream are designed towards edge and IoT devices.

Figure 1 shows the hardware topology of the Dell EMC PowerEdge R740 used in the inference evaluation. It has dual Intel Xeon Skylake CPUs and four Nvidia Tesla T4 GPUs. Each CPU is connected to two GPUs with two PCIe x16 busses. This ensures a balanced configuration and the high number of PCIe lanes guarantee fast data transfer between CPU and GPU. In the performance evaluation, the Nvidia TensorRT 6.0 library was used as the inference backend. The library was included with the NGC TensorRT 19.09 container.

TensorRT 6.0 includes support for new features including: reformat free I/O and layer fusions. These new features help to accelerate the inference in MLPerf benchmarks. Table 2 is a detailed list of the hardware and software used in the inference evaluation.

Performance Evaluation

In order to achieve optimal inference results, some parameter tuning is necessary. As shown in our previous blog "Deep Learning Inference on P40 vs P4 with Skylake", inference throughput increases with an increase in batch size, however it may reach a plateau or even decrease after some point. Therefore, the optimal batch size needs to be found for both Server and Offline scenarios. For the Server scenario, the optimal batch size also needs to satisfy the latency constraint.

Table 3 shows the results of all MLPerf inference benchmarks for the Server and Offline scenarios. The Dell EMC R740 with four T4 GPUs delivered the top inference performance normalized to processor count among commercially available results. All publicly available MLPerf inference v0.5 results are available here.


In this blog, we quantified the inference performance on a Dell EMC PowerEdge R740 server with four Nvidia Tesla T4 GPUs, using MLPerf Inference v0.5 benchmarks. The system delivered the top inference performance normalized to processor count among commercially available results.

Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.

Article ID: SLN319502

Last Date Modified: 11/18/2019 01:20 PM

Rate this article

Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.