Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.
Some article numbers may have changed. If this isn't what you're looking for, try searching all articles. Search articles

Tips and Tricks to Optimize workflow with TF and Horovod on GPUs

Summary: This article details tips and tricks which can be used in order to optimize a workflow using TF and Horovod across multiple GPUs.

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content






Article was written by Rakshith Vasudev & John Lockman - HPC AI Innovation Lab in October 2019


Horovod is a distributed deep learning framework to expedite training on different deep learning frameworks such as Tensorflow, Keras, Pytorch and MXNet. It was developed to make it easier to establish distributed deep learning projects and speed them up with TensorFlow. Horovod supports parallelizing the training process. It supports both data parallelism and model parallelism. When a neural network training job that uses horovod is running, these common tips can be used to debug and to see improvement in performance.



This article uses CheXNet as an example to reference from. CheXNet is an AI radiologist assistant model that utilizes DenseNet to identify up to 14 pathologies from a given chest x-ray image.

  1. The correct environment setup can save a large amount of time when trying to debug performance issues. Make sure the deep learning framework’s GPU version is being used. In this example, tensorflow-gpu packaged by anaconda is being used.

  2. Using horovodrun or mpirun with binding parameters can yield performance gains. Ideally, the process controlling a GPU should be bound to the closest CPU socket. On the Dell EMC PowerEdge C4140, the optimal option would be --map-by socket. There is no need to specify any binding option.

    It will look similar to this: 

    mpirun --map-by socket -np x python - -pyoptions


  3. The job should be setup such that one MPI process works on one GPU. If there are more processes than the number of GPUs, the processes will compete for computing resources and will not be able to run the job with good performance. In the above example, x should be equal number of GPUs to be used.

  4. To set one process per GPU, use tensorflow’s ConfigProto() like so:



  5. To check the number of processes using the GPU, memory consumption of GPU ‘watch nvidia-smi’ could be used. This also allows to see the power consumption.

    Figure 1: Screenshot of nvidia-smi command output showing memory, power and GPU utilization.

  6. If the data pipeline is setup correctly and mpirun parameters are correct, GPU utilization should consistently exceed 90% once model training begins. Occasional dips to 10-25% utilization is acceptable, but they should be infrequent.

  7. Set the batch size such that the GPU memory is almost full but limited so that memory requirements are not exceeded. It is important to consider the manner in which the learning rate scaling is accomplished. Learning rate scaling is the concept by which, as the number of GPUs increase, the learning rate also must be multiplied by a factor proportional to the number of GPUs. This allows the model to converge effectively. This way the number of i/o operations is reduced by fitting the maximum number of possible images on to a GPU without compromising the model convergence. It must be noted that learning rate scaling is not always the best solution to improve model convergence in a distributed workload setting.
    To check if learning rate scaling is needed:

    a)    Train the model with and without learning rate scaling in a distributed mode.
    b)    If the model without learning rate scaling performs better than the model with learning rate scaling, then, learning rate scaling is not needed. 

    When training for convergence in particular, it is not always a mandatory rule to fit the highest possible number of images per batch. There is usually a tradeoff between batch size and convergence (whether the learning rate scaling is used or not) that the data scientist must be able to decide based on their use case.
    Again, you can look at the consumption of GPU memory using ‘watch nvidia-smi’.  In this case study, learning rate scaling was not used as it yielded a better model with AUC values & the local minibatch size was 64. It is typical to have warm up epochs as described in this paper, when learning rate scaling is used.

  8. Profile your jobs using horovod timeline and nvprof to view any bottlenecks that may occur.  More than likely, the bottleneck is due to one of the following:

    a)    Tf data pipeline not setup well, and thus a lot of time is spent preparing the data while the accelerator is idle. To fix this, tf pipeline must be corrected. 
    Please read this article for information on how to set up a tf data pipeline. 

    b)    The communication might not be using the correct fabric – ensure you’re using InfiniBand, to see the fabric usage include –x NCCL_DEBUG=INFO while running mpirun like so:
            mpirun -np 8 --map-by socket -x NCCL_DEBUG=INFO python $HOME/models/chexnet/  --batch_size=128   --epochs=15
            or use horovodrun which includes the –x binding.

  9. To implement distribution properly, GPUs need to communicate effectively with one another. If they’re not communicating effectively, this causes a communication bottleneck. To check to see if they are communicating optimally use the following process:

    First, see how GPUs are talking by using the -x binding as seen above in point 8b

    If GPUs are talking:
    1) Within the node in an optimal way will look similar to this:  
    gpu002:1299562:1299573 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
    2) Outside the node in optimal way will look similar to this:
    gpu028:149460:149495 [0] NCCL INFO Ring 01 : 16 -> 0 [send] via NET/IB/0
    gpu009:164181:164216 [0] NCCL INFO Ring 01 : 12 -> 8 [receive] via NET/IB/0
    gpu009:164181:164216 [0] NCCL INFO Ring 01 : 4 -> 8 [receive] via NET/IB/0

Distributing your deep learning job can sometimes be challenging, especially when the number of nodes/GPUs used doesn’t effectively translate into corresponding performance. To ensure that the most can be obtained from the investment in your accelerator please make sure that the following best practices are implemented:
  • the correct binding options are in place,
  • assessing multiple processes that don’t waste GPU memory,
  • using a modern pipelining approach,
  • profiling to see GPUs are used atleast 80% of the time the job is running,
  • using the latest CUDA related libraries

Article Properties

Affected Product

High Performance Computing Solution Resources, Poweredge C4140

Last Published Date

17 Sep 2021



Article Type