Article was written by Rakshith Vasudev & John Lockman - HPC AI Innovation Lab in October 2019
As introduced previously, CheXNet is an AI radiologist assistant model that utilizes DenseNet to identify up to 14 pathologies from a given chest x ray image. Several approaches were explored to scale out the training of a model that could perform as well as or better than the original CheXNet-121 with ResNet-50 demonstrating promise in both scalability and increased training accuracy (positive AUROC). The authors demonstrated scalability on CPU systems however we are interested in exploiting the parallelism of GPUs to accelerate the training process. In this article we describe best practices to achieve maximum performance on distributed scale out training of CheXNet utilizing Nvidia V100 SXM2 GPUs in Dell EMC C4140 servers. The Dell EMC PowerEdge C4140 provides both density and performance with four Nvidia V100 GPUs in SXM2 configuration.
Hardware Configuration: | Software Configuration: |
---|---|
|
|
As new computing devices (such as GPUs and TPUs) make it possible to train neural networks at an increasingly fast rate, the CPU processing is prone to becoming the bottleneck. The tf.data API provides users with building blocks to design input pipelines that effectively utilize the CPU, optimizing each step of the ETL process.
To perform a training step, you must first extract and transform the training data and then feed it to a model running on an accelerator. However, in a naive synchronous implementation, while the CPU is preparing the data, the accelerator is sitting idle. Conversely, while the accelerator is training the model, the CPU is sitting idle. The training step time is thus the sum of both CPU pre-processing time and the accelerator training time
Pipelining overlaps the preprocessing and model execution of a training step. While the accelerator is performing training step N, the CPU is preparing the data for step N+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract and transform the data.
Without pipelining, the CPU and the GPU/TPU sit idle much of the time:
Fig 1: Sequential execution frequently leaves the GPU idle
With pipelining, idle time diminishes significantly:
Fig 2: Pipelining overlaps CPU and GPU utilization, maximizing GPU utilization
The tf.data API provides a software pipelining mechanism through the tf.data.Dataset.prefetch transformation, which can be used to decouple the time data is produced from the time it is consumed. Specifically, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested.
More info here: https://www.tensorflow.org/guide/performance/datasets
When one follows guidelines provided by tensorflow, it is possible to get a data pipeline that looks like this (old Approach):
https://github.com/dellemc-hpc-ai/code-examples/blob/master/cheXNet-snippets/old_approach.py
In this approach also referred to as the old approach, tf data pipeline does the following (assuming the chest x ray dataset is a sequence of TFRecords):
However, this is not the most performant code. It causes stalls and frequent 0% gpu utilizations. Basically, it is not utilizing the accelerators effectively.
To correctly setup the tf data pipeline, we follow the approach taken by tensorflow official models specifically,the one for ResNet. The difference between the old approach and new approach is how the data pipeline is setup before feeding it to the model.
Here is how the new approach looks like:
https://github.com/dellemc-hpc-ai/code-examples/blob/master/cheXNet-snippets/new_approach.py
TensorFlow official models approach also referred to as new approach is as follows:
Here is a comparison of the performance of both the old and new approaches, using the TF Official models.
Fig 3: With the new approach, close to linear scaling could be expected.
As it can be seen, there is a significant performance hit and there is no scaling at all with the old approach. GPUs would frequently experience little to no utilization, causing performance to flatline. Making computation more efficient on one GPU translates to making computation more efficient on multiple GPUs on multiple nodes, if communication is handled well.
Having CPUs do the pre-processing parallelly, prefetching the processed data in memory & GPUs doing the heavy lifting of matrix multiplication with fast communication is another aspect that makes this new approach more attractive to scale for multinodes.
The environment that runs your jobs are as important as your jobs, because having the right libraries/modules are important as they will impact on the training performance. Also having the latest CUDA related libraries can help to improve performance.
Follow this tutorial to install, if you don’t have a working environment setup.
MPI is a communication library that aids horovod to distribute jobs. Different binding and mapping parameter options were explored and the best parameters for the C4140 was mapping by socket. The recommended setting is as follows:mpirun --map-by socket -np x python pyfile.py --pyoptions
If there are more than one process that are working one gpu, they may very well be fighting for GPU resources, consuming GPU memory and effectively not fitting more images per batch, thus responsible for making GPU performance take a hit. Tensorflow 1.13.1 was explored, but it seemed like it had a bug at the time. It was launching more than one process per GPU.
In summary: