Article Number: 000124388
Rakshith Vasudev and John Lockman – Dell HPC AI Innovation Lab
As many organizations start to adopt AI in their businesses, the desire to more quickly train models will only grow. This is when the potential of acceleration, specifically from GPUs, becomes apparent. GPUs offer the potential to train deep learning models more quickly, sometimes by orders of magnitude, compared to unaccelerated compute.
As introduced previously, CheXNet is an AI radiologist assistant model that uses DenseNet to identify up to 14 pathologies from a given chest x-ray image. Several approaches were explored to scale out the training of a model that could perform as well as or better than the original CheXNet-121 with ResNet-50 demonstrating promise in both scalability and increased training accuracy (positive AUROC). The authors demonstrated scalability on CPU systems however we are interested in exploiting the parallelism of GPUs to accelerate the training process. In this article, we describe how we were able to achieve distributed scale-out training of CheXNet utilizing Nvidia V100 SXM2 GPUs in Dell EMC C4140 servers. The Dell EMC PowerEdge C4140 provides both density and performance with four Nvidia V100 GPUs in SXM2 configuration.
The dataset contains about 112,000 images of frontal chest-x-ray, each labelled with one or more of the fourteen thoracic pathologies. Dataset is very imbalanced with more than half the dataset images have no listed pathologies.
The goal of training CheXNet is not just training per say. We want to expedite the training process from days to hours and that can be done with distributed training. We found out training on single CPU can take few days. For this project, we used horovod to perform distributed training on multiple GPUs. Horovod uses MPI for communication.
Horovod uses the data parallelism approach, here is a quick gist of how it works:
Horovod uses Nvidia’s NCCL to provide optimized version of ring all reduce for collective communication. MPI aids for horovod to enable communication.
The CheXNet paper from Stanford proposed to use the 121 layer denseNet as they improve the flow of information and gradients through the network, making the optimization of very deep networks tractable. The fully connected layer is replaced with one that has a single input with which a sigmoid non-linearity is applied. The weights are initialized with that from a pretrained model on ImageNet. The network is trained using Adam optimizer, the batch size being 16. The learning rate is started at 0.001 that is decayed by a factor of 10 each time the validation loss plateaus after an epoch. The lowest validation loss model is picked.
The original chest x-ray images from NIH were converted into tf records. Input data for CheXNet model is a sequence of sharded tf records that were equally distributed such that every tf record had 256 images. This number is just a multiple of the batch size which in our case was 64. Thus, for every batch the data is fully read and not have to be re-read until they are done.
The converted tf records were subjected to tf data pipelining that makes the GPU run effectively with very less idle time, thus making the training process faster. It loads data from the disk, applies optimized transformations, creates batches and sends it to the GPU. Without data pipelines, GPUs wait for the CPU to load the data, leading to performance issues and data starving. This article describes more in detail about the tf data pipeline setup we used.
We used denseNet as the pretrained base model with ImageNet weights, Adam optimizer is wrapped around with the horovod optimizer to support distributed training, the local mini batch size is 64. The pretrained model is fed to a global average pooling layer where spatial data is averaged and pooled, whose output is fed to a fully connected dense layer with 14 neurons having sigmoid activation function. The learning is started at 0.001 and is decayed by a factor of 10 each time the validation loss pleatues after an epoch. The model was trained for 10 epochs. The weight file with the highest Average AUC value is picked.
To add horovod, we perform the following modifications to the tf.keras model:
Initializing the MPI Environment
Initializing the MPI environment in Horovod only requires calling the init method:
import horovod.tensorflow.keras as hvd
This will ensure that the MPI_Init function is called, setting up the communications structure and assigning ranks to all workers.
Broadcasting the neuron weights is done using a callback to the model.fit Keras method. In fact, many of horovod’s features are implemented as callbacks to Model.fit, so it’s worthwhile to define a callback list object for holding all the callbacks.
callbacks = [
You’ll notice that the BroadcastGlobalVariablesCallback takes a single argument that’s been set to 0. This is the root worker, which will be responsible for reading checkpoint files or generating new initial weights, broadcasting weights at the beginning of the training run, and writing checkpoint files periodically so that work is not lost if a training job fails or terminates.
Wrapping the Optimizer Function
The optimizer function must be wrapped so that it can aggregate error information from all workers before executing. Horovod’s DistributedOptimizer function can wrap any optimizer which inherits tf.Keras’ base Optimizer class, including SGD, Adam, Adadelta, Adagrad, and others.
from tensorflow.keras import optimizers
hvd_opt = hvd.DistributedOptimizer(optimizers.Adam(lr=0.001))
The distributed optimizer will now use the MPI_Allgather collective to aggregate error information from training batches onto allworkers, rather than collecting them only to the root worker. This allows the workers to independently update their models rather than waiting for the root to re-broadcast updated weights before beginning the next training batch.
Between steps error metrics need to be averaged to calculate global loss. Horovod provides another callback function to do this called MetricAverageCallback.
callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),
This will ensure that optimizations are performed on the global metrics, not the metrics local to each worker.
Writing Checkpoints from a Single Worker
When using distributed deep learning, it’s important that only one worker write checkpoint files to ensure that multiple workers writing to the same file does not produce a race condition, which could lead to checkpoint corruption.
Checkpoint writing in tf.keras is enabled by another callback to Model.fit. However, we only want to call this callback from one worker instead of all workers. By convention, we use worker 0 for this task, but technically we could use any worker for this task. The one good thing about worker 0 is that even if you decide to run your distributed deep learning job with only 1 worker, that worker will be worker 0.
callbacks = [ ... ]
if hvd.rank() == 0:
Outcome: Distributed deep learning training improves the training time and Image throughput by orders of magnitude.
The following chart explains how distributing the job can expedite the training process on GPUs. Seven tests shown are the training speed of tf.keras Densenet 121 model on 1 x V100, SXM2, 32GB GPU through 16 x V100, SXM2, 32GB GPUs. By using 4 x C4140 nodes having 16 GPUs distributed deep learning was able to provide a 10.5x improvement in training speed, taking the training time for 10 epochs on ChestXray 14 dataset from 70 minutes to 6.7 minutes.
Fig 1 : Throughput performance comparisons of CheXNet with distributed deep learning using Horovod.
While the throughput scaling is nearly linear, it can be noticed that the time to train is linear as well. There is a little performance hit when we go out of node. However, using a fast fabric helps reduce the communication bottlenecks.
Fig 2 : Time to train CheXNet using DenseNet121 with distributed training is reduced 10.5x with 16GPUs.
Since CheXNet is a multilabel and multiclass classification problem and is an imbalanced dataset, the correct metric to assess the model would be AUCROC. The following scatterplot shows the AUC values of all the 14 different classes predicted by the trained model.
Fig 3: AUC values of CheXNet using DenseNet121 with distributed training.
In this blog, we’ve seen how to accelerate the training process when developing the neural-network based models with distributed deep learning. This blog showed the process of transforming a tf.keras model to take advantage of multiple nodes using the horovod framework and how a few simple code changes with some additional compute infrastructure can reduce the time needed to train a model from few hours to minutes.
High Performance Computing Solution Resources, PowerEdge C4140
10 Apr 2021