...
Article was written by Rakshith Vasudev & John Lockman - HPC AI Innovation Lab in October 2019
Table of Contents Introduction 1. Bare Metal 2. KubernetesSoftware VersionsReal World Use Case: CheXNetHardware SpecificationsPerformanceSummary Introduction In this article, we evaluate scaling performance when training CheXNet on Nvidia V100 SXM2 GPUs in Dell EMC C4140 servers using two approaches used in modern data centers. The traditional HPC "Bare Metal" with an environment built by Anaconda, and a containerized system with Nvidia GPU Cloud (NGC) containers running in an on-prem Kubernetes environment.Bare MetalA Bare metal system is a traditional HPC cluster where software stacks are installed directly on the local hard disk or a shared network mount. Management of software environments is performed by a system administrator. The users are restricted to building software in a shared /home filesystem. User code is batch scheduled by the Slurm workload manager.KubernetesOur Kubernetes (K8s) system utilizes Nvidia’s NGC containers to provide all required software prerequisites, environment configs, etc. The system administrator only installs the base operating system, drivers, and k8s. These docker based containers can be downloaded from NGC during the run or stored in a local registry. K8s handles workload management, availability of resources, launching distributed jobs and scaling on demand. Software Versions NGC Container nvcr.io/nvidia/tensorflow:19.06- py3 Conda env Versions Framework TensorFlow 1.13.1 TensorFlow 1.12.0 Horovod 0.15.1 0.16.1 MPI OpenMPI 3.1.3 OpenMPI 4.0.0 CUDA 10.2 10.1 CUDA Driver 430.26 418.40.04 NCCL 2.4.7 2.4.7 CUDNN 7.6.0 7.6.0 Python 3.5.2 3.6.8 Operating System Ubuntu 16.04.6 RHEL 7.4 GCC 5.4.0 7.2.0 Table 1 Real World Use Case: CheXNet As introduced previously, CheXNet is an AI radiologist assistant model that uses DenseNet to identify up to 14 pathologies from a given chest x-ray image. Several approaches were explored to scale out the training of a model that could perform as well as or better than the original CheXNet-121 with ResNet-50 demonstrating promise in both scalability and increased training accuracy (positive AUROC). The authors demonstrated scalabilities on CPU systems however we are interested in exploiting the parallelism of GPUs to accelerate the training process. The Dell EMC PowerEdge C4140 provides both density and performance with four Nvidia V100 GPUs in the SXM2 configuration. Hardware Specifications Bare Metal System Kubernetes System Platform PowerEdge C4140 PowerEdge C4140 CPU 2 x Intel® Xeon® Gold 6148 @2.4GHz 2 x Intel® Xeon® Gold 6148 @2.4GHz Memory 384 GB DDR4 @ 2666MHz 384 GB DDR4 @ 2666MHz Storage Lustre NFS GPU V100-SXM2 32GB V100-SXM2 32GB Operating System RHEL 7.4 x86_64 CentOS 7.6 Linux Kernel 3.10.0-693.x86_64 3.10.0-957.21.3.el7.x86_64 Network Mellanox EDR InfiniBand Mellanox EDR InfiniBand (IP over IB) Table 2 Performance The image throughput, measured in images per second, when training CheXNet was measured using 1, 2, 3, 4, and 8 GPUs across 2 C4140 nodes on both systems described in Table 2. The specifications of the run including the model architecture, input data, etc. are detailed in this article . Figure 1 shows the measured performance comparison on the Kubernetes system and the bare metal system. Figure 1: Running CheXNet training on K8s vs Bare Metal Summary The bare metal system demonstrates an 8% increase in performance as we scale out to 8GPUs. However, the differences in the design of the system architecture could cause this slight performance difference, beyond just the container vs bare metal argument. The bare metal system can take advantage of the full bandwidth and latency of the raw InfiniBand connection and does not have to deal with the overhead created with Software Defined Networks such as a flannel. It is also the case that the K8s system is using IP over InfiniBand which can reduce available bandwidth. These numbers may vary depending on the workload and the communication patterns defined by the kind of applications that are run. In the case of an image classification problem, the rate at which communication occurs between GPUs is high and thus there is a high exchange rate. However, whether to use one approach over the other is dependent on the needs of the workload. Although our Kubernetes based system has a small performance penalty, ~8% in this case, it relieves users and administrators from setting up libraries, configs, environments and other dependencies. This approach empowers the data scientists to be more productive and focus on solving core business problems such as data wrangling and model building.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.