Deep Learning Performance on V100 GPUs with ResNet-50 Model

Deep Learning Performance on V100 GPUs with ResNet-50 Model


Article written by Rengan Xu, Frank Han and Quy Ta of HPC and AI Innovation Lab in May 2019.


Abstract

Dell EMC Ready Solutions for AI – Deep Learning with NVIDA v1.1 and the corresponding reference architecture guide were released in February 2019. This blog will quantify the deep learning training performance on this reference architecture using ResNet-50 model. The performance evaluation will be scaled on up to eight nodes.

Overview

In August 2018, the initial version 1.0 of Dell EMC Ready Solutions for AI – Deep Learning with NVIDIA was released. In February 2019, this solution was updated to version 1.1. The main difference is that in version 1.1, the CPU and GPU connection topology has been changed from configuration K to configuration M. The comparison of these two different configurations is shown in Figure 1. Unlike configuration K which has only one PCIe link between two CPUs and four GPUs, the new configuration M has four PCIe links between them and the memory size of each GPU has changed from 16GB in Ready Solution v1.0 to 32GB in v1.1.


The hardware infrastructure of the solution is shown in Figure 2. The infrastructure includes one head node PowerEdge R740xd, n compute nodes PowerEdge C4140, the local disks on the cluster head node exported over NFS, Isilon storage, and two networks. All compute nodes are interconnected through an InfiniBand switch. The head node is also connected to the InfiniBand switch as it needs to access the Isilon storage when included and uses IPoIB to export the scratch space of NFS share to the compute nodes. All compute nodes and the head node are also connected to a 1 Gigabit Ethernet management switch which is used for in-band and out of band management via iDRAC9 (Integrated Dell Remote Access Controller) as well as provisioning and deployment network by Bright Cluster Manager to administer the cluster. An Isilon storage solution is connected to the FDR-40GigE Gateway switch so that it can be accessed by the head node and all compute nodes.

Figure 2: The infrastructure of the ready solution

The ResNet-50 model was used to evaluate the performance of this ready solution. This is one of the models in MLPerf benchmark suite which is trying to establish the benchmark standard in machine learning field. Following the philosophy of MLPerf, we measured the wall clock time for ResNet-50 model training until the model converges to the target Top-1 evaluation accuracy 74.9%. The benchmark we used is from Nvidia Deep Learning Examples git repository. We added the distributed launch script from MXNet repository to run this model on distributed servers. The hardware and software details of this evaluation are list in Table 1.

Table 1: The hardware configuration and software details

Platform

PowerEdge C4140

CPU

2 x Intel® Xeon® Gold 6148 @3.0GHz (Skylake)

Memory

384 GB DDR4 @ 2666MHz

Storage

96 TB Isilon F800

GPU

V100-SXM2 with 32GB memory

OS and Firmware

Operating System

RHEL 7.5 x86_64

Linux Kernel

3.10.0-693.el7.x86_64

BIOS

1.6.12

Deep Learning related

MXNet

Nvidia-mxnet-18.12-py3 container

ResNet-50 v1.5

https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5 in commit 0e66c6dabb8b4c90bd637e27aeb4e67722ca95fc

Performance Evaluation

Figure 3 shows the ResNet-50 training time to the target accuracy 74.9% with the C4140-M in ready solution v1.1. Figure 4 shows the throughput comparison to the C4140-K in ready solution v1.0. Both throughput and time-to-accuracy results are shown here because these two metrics are not always correlated. The testing was scaled from one node (4 V100) to eight nodes (32 V100). The Dell EMC ready solution is a scale-out solution which can utilize more resources if more nodes are added in the solution. There is an alternate solution called scale-up solution from other vendors, which tries to put more GPUs into one server. We also compared our scale-out solution with other vendor’s scale-up solution in Figure 3. The following conclusions can be made from Figure 3 and Figure 4:

  • With the same number of GPUs for this model, both scale-out and scale-up solutions have the same performance.*
  • In Figure 3, the speedup is 1.8x, 3.3x, and 4.9x with 2, 4 and 8 nodes, respectively. The speedup is a bit low with eight nodes which is because the model needs more epochs to converge. In this evaluation, the model converged after 61 epochs, 62 epochs, 65 epochs and 81 epochs with 1, 2, 4 and 8 nodes respectively.
  • The throughput of model training is much higher in solution 1.1 than in solution 1.0. This is mainly because of the software update.

* Both data of scale-up systems was public available at the MLPerf v0.5 results web page.

Figure 3: The time to accuracy comparison

Figure 4: The throughput comparison

Storage and Network Analysis

How storage and network are utilized are analyzed in this section. The Isilon InsightIQ tool was used to monitor the usage of Isilon storage and the Mellanox Unified Fabric Manager (UFM) was used to monitor the usage of InfiniBand EDR. Figure 5 shows the Isilon disk throughput with 1, 2, 4 and 8 nodes, respectively. The following conclusions can be made from this figure:

  • The peak disk throughput increased by ~66% when doubling the number of nodes.
  • The disk throughput decreases because of caching in Isilon storage. The throughput will decrease to zero when all the data are cached into the system memory on each compute node. The whole dataset is 144GB which can be easily cached into 384GB system memory.



(a) 1 node

(b) 2 nodes

(4) 4 nodes

(d) 8 nodes
Figure 5: The disk throughput from Isilon storage

Figure 6 shows the InfiniBand EDR send and receive throughput with 1, 2, 4 and 8 nodes, respectively. The following conclusions can be made from this figure:

  • The peak receive throughput in the beginning comes from the data reading from the disks of Isilon storage.
  • The lower receive throughput in the beginning is because of the data reading from the file system of Isilon storage. At this stage, the data are read from Isilon storage cache, but not from disks anymore.
  • The many sharp drops during the course of training in both send and receive throughputs are because of data shuffle operation after each epoch. When the data shuffling occurs, there is no communication between different nodes.
  • When one node was used, the InfiniBand only has receive throughput, which comes from the data reading from Isilon storage.
  • As the number of nodes doubled, the send and receive throughput increased by ~100 MB/s.



(a) 1 node

(b) 2 nodes

(c) 4 nodes


(d) 8 nodes
Figure 6: The InfiniBand EDR throughput

Conclusions and Future Work

In this blog, we quantified the performance of the Dell EMC ready solution v1.1 with ResNet-50 v1.5 model. The result shows that the scale-out solution can achieve comparable performance with other scale-up solution. And compared to the ready solution v1.0, the current solution has much higher training throughput. The storage and network usage were also profiled. When the number of nodes doubled, the peak disk throughput increased by ~66%, and the network throughput increased by 100 MB/s. In the future work, we will further evaluate the performance of the ready solution with other benchmarks like Object detection, Translation and Recommendation in the MLPerf suite.



Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.

Article ID: SLN317397

Last Date Modified: 07/12/2019 03:45 PM


Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.