Bare Metal vs Kubernetes : Distributed Training with TensorFlow

Bare Metal vs Kubernetes : Distributed Training with TensorFlow


Article was written by Rakshith Vasudev & John Lockman - HPC AI Innovation Lab in October 2019



Table of Contents

  1. Introduction
    1. Bare Metal
    2. Kubernetes
  2. Software Versions
  3. Real World Use Case: CheXNet
  4. Hardware Specifications
  5. Performance
  6. Summary

Introduction

In this article, we evaluate scaling performance when training CheXNet on Nvidia V100 SXM2 GPUs in Dell EMC C4140 servers using two approaches used in modern data centers. The traditional HPC "Bare Metal" with an environment built by Anaconda, and a containerized system with Nvidia GPU Cloud (NGC) containers running in an on-prem Kubernetes environment.

Bare Metal
A Bare metal system is a traditional HPC cluster where software stacks are installed directly on the local hard disk or a shared network mount. Management of software environments is performed by a system administrator. The users are restricted to building software in a shared /home filesystem. User code is batch scheduled by the Slurm workload manager.

Kubernetes
Our Kubernetes (K8s) system utilizes Nvidia’s NGC containers to provide all required software prerequisites, environment configs, etc. The system administrator only installs the base operating system, drivers, and k8s. These docker based containers can be downloaded from NGC during the run or stored in a local registry. K8s handles workload management, availability of resources, launching distributed jobs and scaling on demand.

Software Versions

NGC Container nvcr.io/nvidia/tensorflow:19.06- py3

Conda env Versions

Framework

TensorFlow 1.13.1

TensorFlow 1.12.0

Horovod

0.15.1

0.16.1

MPI

OpenMPI 3.1.3

OpenMPI 4.0.0

CUDA

10.2

10.1

CUDA Driver

430.26

418.40.04

NCCL

2.4.7

2.4.7

CUDNN

7.6.0

7.6.0

Python

3.5.2

3.6.8

Operating System

Ubuntu 16.04.6

RHEL 7.4

GCC

5.4.0

7.2.0

Table 1


Real World Use Case: CheXNet

As introduced previously, CheXNet is an AI radiologist assistant model that uses DenseNet to identify up to 14 pathologies from a given chest x-ray image. Several approaches were explored to scale out the training of a model that could perform as well as or better than the original CheXNet-121 with ResNet-50 demonstrating promise in both scalability and increased training accuracy (positive AUROC). The authors demonstrated scalabilities on CPU systems however we are interested in exploiting the parallelism of GPUs to accelerate the training process. The Dell EMC PowerEdge C4140 provides both density and performance with four Nvidia V100 GPUs in the SXM2 configuration.


Hardware Specifications

Bare Metal System

Kubernetes System

Platform

PowerEdge C4140

PowerEdge C4140

CPU

2 x Intel® Xeon® Gold 6148 @2.4GHz

2 x Intel® Xeon® Gold 6148 @2.4GHz

Memory

384 GB DDR4 @ 2666MHz

384 GB DDR4 @ 2666MHz

Storage

Lustre

NFS

GPU

V100-SXM2 32GB

V100-SXM2 32GB

Operating System

RHEL 7.4 x86_64

CentOS 7.6

Linux Kernel

3.10.0-693.x86_64

3.10.0-957.21.3.el7.x86_64

Network

Mellanox EDR InfiniBand

Mellanox EDR InfiniBand

(IP over IB)

Table 2

Performance

The image throughput, measured in images per second, when training CheXNet was measured using 1, 2, 3, 4, and 8 GPUs across 2 C4140 nodes on both systems described in Table 2. The specifications of the run including the model architecture, input data, etc. are detailed in this article . Figure 1 shows the measured performance comparison on the Kubernetes system and the bare metal system.


Figure 1: Running CheXNet training on K8s vs Bare Metal


Summary

The bare metal system demonstrates an 8% increase in performance as we scale out to 8GPUs. However, the differences in the design of the system architecture could cause this slight performance difference, beyond just the container vs bare metal argument. The bare metal system can take advantage of the full bandwidth and latency of the raw InfiniBand connection and does not have to deal with the overhead created with Software Defined Networks such as a flannel. It is also the case that the K8s system is using IP over InfiniBand which can reduce available bandwidth.
These numbers may vary depending on the workload and the communication patterns defined by the kind of applications that are run. In the case of an image classification problem, the rate at which communication occurs between GPUs is high and thus there is a high exchange rate. However, whether to use one approach over the other is dependent on the needs of the workload. Although our Kubernetes based system has a small performance penalty, ~8% in this case, it relieves users and administrators from setting up libraries, configs, environments and other dependencies. This approach empowers the data scientists to be more productive and focus on solving core business problems such as data wrangling and model building.



Need more help?
Find additional PowerEdge and PowerVault articles
Watch Part Replacement Videos for Enterprise products

Visit and ask for support in our Communities

Create an online support Request



Article ID: SLN318899

Last Date Modified: 10/14/2019 09:52 PM


Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.