Optimization Techniques for Training CheXNet on Dell C4140 with Nvidia V100 GPUs

Summary: Best practices to achieve maximum performance on distributed scale out training of CheXNet utilizing Nvidia V100 SXM2 GPUs in Dell EMC C4140 servers.

This article applies to This article does not apply to

Symptoms

Article was written by Rakshith Vasudev & John Lockman - HPC AI Innovation Lab in October 2019

Cause

Resolution

As introduced previously, CheXNet is an AI radiologist assistant model that utilizes DenseNet to identify up to 14 pathologies from a given chest x ray image. Several approaches were explored to scale out the training of a model that could perform as well as or better than the original CheXNet-121 with ResNet-50 demonstrating promise in both scalability and increased training accuracy (positive AUROC). The authors demonstrated scalability on CPU systems however we are interested in exploiting the parallelism of GPUs to accelerate the training process. In this article we describe best practices to achieve maximum performance on distributed scale out training of CheXNet utilizing Nvidia V100 SXM2 GPUs in Dell EMC C4140 servers. The Dell EMC PowerEdge C4140 provides both density and performance with four Nvidia V100 GPUs in SXM2 configuration.

Hardware Configuration:	Software Configuration:
4 PowerEdge C4140 4 Nvidia V100 32GB SXM2 2 20 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz 384 GB RAM, DDR4 2666MHz 1 Mellanox EDR HCA Lustre file system	Deep Learning Framework: tensorflow-gpu Framework version: 1.12.0 Horovod version: 0.16.4 MPI version: 4.0.0 with cuda and ucx support CUDA version: 10.1.105 CUDNN version: 7.6.0 NCCL version: 2.4.7 Python version: 3.6.8 OS and version: RHEL 7.4

The data pipeline is critical to getting the most performance from your accelerators:

What is tf data, why should you strive to use it?

As new computing devices (such as GPUs and TPUs) make it possible to train neural networks at an increasingly fast rate, the CPU processing is prone to becoming the bottleneck. The tf.data API provides users with building blocks to design input pipelines that effectively utilize the CPU, optimizing each step of the ETL process.

To perform a training step, you must first extract and transform the training data and then feed it to a model running on an accelerator. However, in a naive synchronous implementation, while the CPU is preparing the data, the accelerator is sitting idle. Conversely, while the accelerator is training the model, the CPU is sitting idle. The training step time is thus the sum of both CPU pre-processing time and the accelerator training time

Pipelining overlaps the preprocessing and model execution of a training step. While the accelerator is performing training step N, the CPU is preparing the data for step N+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract and transform the data.

Without pipelining, the CPU and the GPU/TPU sit idle much of the time:

SLN318898_en_US__1Sequantial execution
Fig 1: Sequential execution frequently leaves the GPU idle

With pipelining, idle time diminishes significantly:

SLN318898_en_US__2Pipelining overlaps
Fig 2: Pipelining overlaps CPU and GPU utilization, maximizing GPU utilization

The tf.data API provides a software pipelining mechanism through the tf.data.Dataset.prefetch transformation, which can be used to decouple the time data is produced from the time it is consumed. Specifically, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested.

More info here: https://www.tensorflow.org/guide/performance/datasets

When one follows guidelines provided by tensorflow, it is possible to get a data pipeline that looks like this (old Approach):
https://github.com/dellemc-hpc-ai/code-examples/blob/master/cheXNet-snippets/old_approach.py

In this approach also referred to as the old approach, tf data pipeline does the following (assuming the chest x ray dataset is a sequence of TFRecords):

Gets the absolute list of filenames.
Builds a dataset from the list of filenames using TFRecordDataset()
Create a new dataset that loads and formats images by preprocessing them.
Shard the dataset.
Shuffle the dataset when training.
Repeat the dataset.
Batch the dataset.
Prefetch the dataset for the batch_size.

However, this is not the most performant code. It causes stalls and frequent 0% gpu utilizations. Basically, it is not utilizing the accelerators effectively.

To correctly setup the tf data pipeline, we follow the approach taken by tensorflow official models specifically,the one for ResNet. The difference between the old approach and new approach is how the data pipeline is setup before feeding it to the model.

Here is how the new approach looks like:
https://github.com/dellemc-hpc-ai/code-examples/blob/master/cheXNet-snippets/new_approach.py

TensorFlow official models approach also referred to as new approach is as follows:

Gets the absolute list of filenames.
Builds a dataset from the list of filenames using from_tensor_slices()
Sharding is done ahead of time.
The dataset is shuffled during training.
The dataset is then parallelly interleaved, which is interleaving and processing multiple files (defined by cycle_length) to transform them to create TFRecord dataset.
The dataset is then prefetched. The buffer_size defines how many records are prefetched, which is usually the mini batch_size of the job.
The dataset is again shuffled. Details of the shuffle is controlled by buffer_size.
The dataset is repeated. It repeats the dataset until num_epochs to train.
The dataset is subjected to simultaneous map_and_batch() which parses the tf record files, which in turn preprocesses the image and batches them.
The preprocessed image is ready as a dataset and prefetched again.

Here is a comparison of the performance of both the old and new approaches, using the TF Official models.

SLN318898_en_US__3image(12053)
Fig 3: With the new approach, close to linear scaling could be expected.

As it can be seen, there is a significant performance hit and there is no scaling at all with the old approach. GPUs would frequently experience little to no utilization, causing performance to flatline. Making computation more efficient on one GPU translates to making computation more efficient on multiple GPUs on multiple nodes, if communication is handled well.
Having CPUs do the pre-processing parallelly, prefetching the processed data in memory & GPUs doing the heavy lifting of matrix multiplication with fast communication is another aspect that makes this new approach more attractive to scale for multinodes.

Other techniques to be looked at to achieve optimal performance:

Having the correct environment is important:

The environment that runs your jobs are as important as your jobs, because having the right libraries/modules are important as they will impact on the training performance. Also having the latest CUDA related libraries can help to improve performance.

Follow this tutorial to install, if you don’t have a working environment setup.

Using correct binding parameters for MPI:

MPI is a communication library that aids horovod to distribute jobs. Different binding and mapping parameter options were explored and the best parameters for the C4140 was mapping by socket. The recommended setting is as follows:
mpirun --map-by socket -np x python pyfile.py --pyoptions

Ensure one process acts on one GPU:

If there are more than one process that are working one gpu, they may very well be fighting for GPU resources, consuming GPU memory and effectively not fitting more images per batch, thus responsible for making GPU performance take a hit. Tensorflow 1.13.1 was explored, but it seemed like it had a bug at the time. It was launching more than one process per GPU.

In summary:

Having data pipeline correctly setup is critical to see performance gains.
Setting up the correct environment is a good contributor to better performance.
Using correct binding parameters for MPI helps in improving performance.
Profile and fix bottlenecks when GPUs are not fully utilized.

Affected Products

High Performance Computing Solution Resources, Poweredge C4140

Optimization Techniques for Training CheXNet on Dell C4140 with Nvidia V100 GPUs

Summary: Best practices to achieve maximum performance on distributed scale out training of CheXNet utilizing Nvidia V100 SXM2 GPUs in Dell EMC C4140 servers.

Symptoms

Cause

Resolution

The data pipeline is critical to getting the most performance from your accelerators:

What is tf data, why should you strive to use it?

Other techniques to be looked at to achieve optimal performance:

Having the correct environment is important:

Using correct binding parameters for MPI:

Ensure one process acts on one GPU:

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

Optimization Techniques for Training CheXNet on Dell C4140 with Nvidia V100 GPUs

Summary: Best practices to achieve maximum performance on distributed scale out training of CheXNet utilizing Nvidia V100 SXM2 GPUs in Dell EMC C4140 servers.

Detailed Article

Symptoms

Cause

Resolution

Affected Products

Symptoms

Cause

Resolution

The data pipeline is critical to getting the most performance from your accelerators:

What is tf data, why should you strive to use it?

Other techniques to be looked at to achieve optimal performance:

Having the correct environment is important:

Using correct binding parameters for MPI:

Ensure one process acts on one GPU:

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services