Deep Learning Performance on V100 GPUs with ResNet-50 Model

Table of Contents

Detailed Article

Symptoms

Cause

Resolution

Additional Info

Affected Products

Provide Feedback

Summary: Deep Learning Performance on V100 GPUs with ResNet-50 Model

This article applies to This article does not apply to

Check out resources for

Symptoms

Abstract

Dell EMC Ready Solutions for AI – Deep Learning with NVIDA v1.1 and the corresponding reference architecture guide were released in February 2019. This blog will quantify the deep learning training performance on this reference architecture using ResNet-50 model. The performance evaluation will be scaled on up to eight nodes.

Overview

In August 2018, the initial version 1.0 of Dell EMC Ready Solutions for AI – Deep Learning with NVIDIA was released. In February 2019, this solution was updated to version 1.1. The main difference is that in version 1.1, the CPU and GPU connection topology has been changed from configuration K to configuration M. The comparison of these two different configurations is shown in Figure 1. Unlike configuration K which has only one PCIe link between two CPUs and four GPUs, the new configuration M has four PCIe links between them and the memory size of each GPU has changed from 16GB in Ready Solution v1.0 to 32GB in v1.1.

The hardware infrastructure of the solution is shown in Figure 2. The infrastructure includes one head node PowerEdge R740xd, n compute nodes PowerEdge C4140, the local disks on the cluster head node exported over NFS, Isilon storage, and two networks. All compute nodes are interconnected through an InfiniBand switch. The head node is also connected to the InfiniBand switch as it needs to access the Isilon storage when included and uses IPoIB to export the scratch space of NFS share to the compute nodes. All compute nodes and the head node are also connected to a 1 Gigabit Ethernet management switch which is used for in-band and out of band management via iDRAC9 (Integrated Dell Remote Access Controller) as well as provisioning and deployment network by Bright Cluster Manager to administer the cluster. An Isilon storage solution is connected to the FDR-40GigE Gateway switch so that it can be accessed by the head node and all compute nodes.
SLN317397_en_US__200_infrastructure

Figure 2: The infrastructure of the ready solution

The ResNet-50 model was used to evaluate the performance of this ready solution. This is one of the models in MLPerf benchmark suite which is trying to establish the benchmark standard in machine learning field. Following the philosophy of MLPerf, we measured the wall clock time for ResNet-50 model training until the model converges to the target Top-1 evaluation accuracy 74.9%. The benchmark we used is from Nvidia Deep Learning Examples git repository. We added the distributed launch script from MXNet repository to run this model on distributed servers. The hardware and software details of this evaluation are list in Table 1.

Table 1: The hardware configuration and software details

Platform	PowerEdge C4140
CPU	2 x Intel® Xeon® Gold 6148 @3.0GHz (Skylake)
Memory	384 GB DDR4 @ 2666MHz
Storage	96 TB Isilon F800
GPU	V100-SXM2 with 32GB memory
OS and Firmware
Operating System	Red Hat® Enterprise Linux® 7.5 x86_64
Linux Kernel	3.10.0-693.el7.x86_64
BIOS	1.6.12
Deep Learning related
MXNet	Nvidia-mxnet-18.12-py3 container
ResNet-50 v1.5	https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5 in commit 0e66c6dabb8b4c90bd637e27aeb4e67722ca95fc

Cause

Performance Evaluation

Figure 3 shows the ResNet-50 training time to the target accuracy 74.9% with the C4140-M in ready solution v1.1. Figure 4 shows the throughput comparison to the C4140-K in ready solution v1.0. Both throughput and time-to-accuracy results are shown here because these two metrics are not always correlated. The testing was scaled from one node (4 V100) to eight nodes (32 V100). The Dell EMC ready solution is a scale-out solution which can utilize more resources if more nodes are added in the solution. There is an alternate solution called scale-up solution from other vendors, which tries to put more GPUs into one server. We also compared our scale-out solution with other vendor’s scale-up solution in Figure 3. The following conclusions can be made from Figure 3 and Figure 4:

With the same number of GPUs for this model, both scale-out and scale-up solutions have the same performance.*
In Figure 3, the speedup is 1.8x, 3.3x, and 4.9x with 2, 4 and 8 nodes, respectively. The speedup is a bit low with eight nodes which is because the model needs more epochs to converge. In this evaluation, the model converged after 61 epochs, 62 epochs, 65 epochs and 81 epochs with 1, 2, 4 and 8 nodes respectively.
The throughput of model training is much higher in solution 1.1 than in solution 1.0. This is mainly because of the software update.

* Both data of scale-up systems was public available at the MLPerf v0.5 results web page.
SLN317397_en_US__300_resnet50_mx_m
Figure 3: The time to accuracy comparison
SLN317397_en_US__400_speed_rn50_mx
Figure 4: The throughput comparison

Storage and Network Analysis

How storage and network are utilized are analyzed in this section. The Isilon InsightIQ tool was used to monitor the usage of Isilon storage and the Mellanox Unified Fabric Manager (UFM) was used to monitor the usage of InfiniBand EDR. Figure 5 shows the Isilon disk throughput with 1, 2, 4 and 8 nodes, respectively. The following conclusions can be made from this figure:

The peak disk throughput increased by ~66% when doubling the number of nodes.
The disk throughput decreases because of caching in Isilon storage. The throughput will decrease to zero when all the data are cached into the system memory on each compute node. The whole dataset is 144GB which can be easily cached into 384GB system memory.

SLN317397_en_US__500_rn50_mx_1node
(a) 1 node
SLN317397_en_US__600_rn50_mx_2node
(b) 2 nodes
SLN317397_en_US__700_rn50_mx_4node
(4) 4 nodes
SLN317397_en_US__800_rn50_mx_8node
(d) 8 nodes
Figure 5: The disk throughput from Isilon storage

Figure 6 shows the InfiniBand EDR send and receive throughput with 1, 2, 4 and 8 nodes, respectively. The following conclusions can be made from this figure:

The peak receive throughput in the beginning comes from the data reading from the disks of Isilon storage.
The lower receive throughput in the beginning is because of the data reading from the file system of Isilon storage. At this stage, the data are read from Isilon storage cache, but not from disks anymore.
The many sharp drops during the course of training in both send and receive throughputs are because of data shuffle operation after each epoch. When the data shuffling occurs, there is no communication between different nodes.
When one node was used, the InfiniBand only has receive throughput, which comes from the data reading from Isilon storage.
As the number of nodes doubled, the send and receive throughput increased by ~100 MB/s.

SLN317397_en_US__900_ib_1node
(a) 1 node
SLN317397_en_US__1000_ib_2nodes
(b) 2 nodes
SLN317397_en_US__1100_ib_4nodes
(c) 4 nodes
SLN317397_en_US__1200_ib_8nodes

(d) 8 nodes
Figure 6: The InfiniBand EDR throughput

Resolution

Conclusions and Future Work

In this blog, we quantified the performance of the Dell EMC ready solution v1.1 with ResNet-50 v1.5 model. The result shows that the scale-out solution can achieve comparable performance with other scale-up solution. And compared to the ready solution v1.0, the current solution has much higher training throughput. The storage and network usage were also profiled. When the number of nodes doubled, the peak disk throughput increased by ~66%, and the network throughput increased by 100 MB/s. In the future work, we will further evaluate the performance of the ready solution with other benchmarks like Object detection, Translation and Recommendation in the MLPerf suite.

Additional Information

Article written by Rengan Xu, Frank Han and Quy Ta of HPC and AI Innovation Lab in May 2019.

Affected Products

High Performance Computing Solution Resources, Poweredge C4140

Deep Learning Performance on V100 GPUs with ResNet-50 Model