AI Cost Optimization Deep Learning

How to Speed Up Deep Learning Model Training in the Automotive Sector

Subtitle: Enabling Lane Detection at Scale with NetApp, Run:AI and Microsoft Azure

Today’s automotive leaders are investing heavily in data-driven software applications to advance the most important innovations in autonomous and connected vehicles, mobility, and manufacturing. These new applications require an orchestration solution and a shared-file system for their massive data sets in order to run distributed training of deep learning models on Graphics Processing Units (GPUs). The fascinating process for training AI models in the automotive industry involves many, many images that are being used in a 3d matrix that is formed from 2D color images. These are being analyzed at the pixel and color (RGB) level to detect various objects, like pedestrians, other cars, traffic lights etc.

GPUs need to be maintained at high utilization in order to reduce training times, allow for fast experimentation, and minimize the cost of usage. In addition, a high-performance, easy to use file system that will prevent GPUs from waiting for data, i.e. GPU starvation, is imperative in accelerating model training in the cloud and optimizing cost.

Run:AI, Microsoft, and NetApp have teamed together to address a lane detection use case by building a Distributed Training Deep Learning solution at scale that fully runs in the Azure cloud. This enables customers and their data scientists to fully embrace the Azure Cloud scaling capabilities and cost benefits for automotive use-cases.

Here’s what we did and the tools we used:

  • Azure NetApp Files provided a high-performance, low-latency, scalable storage leveraging snapshots/cloning/replication.
  • Azure Kubernetes Service (AKS) was used to simplify deploying and orchestrating a managed Kubernetes cluster in Azure.
  • Azure Compute SKUs with GPUs. These GPU-optimized VM sizes are specialized virtual machines available with single or multiple GPUs.
  • Run:AI enabled pooling of GPU into two logical environments, one for build and one for training workloads. A scheduler manages the requests for compute that come from data scientists, enabling elastic scaling from fractions of GPU to multi-GPU and multiple nodes of GPUs. The Run:AI platform is built on top of Kubernetes, enabling simple integration with existing IT and data science workflows.
  • Trident integrates natively with AKS and its Persistent Volume framework and was used to seamlessly provision and manage volumes from systems running on Azure NetApp Files.
  • Finally, ML versioning was done by leveraging Azure NetApp Files Snapshot technology combined with Run:AI for preserving data lineage and allowing data science and data engineering to collaborate and share data with their colleagues.

What we found:

By working with Run:AI, Azure, and NetApp we enabled distributed computations in the cloud, creating a high-performing distributed training system working with tens of GPUs that communicated simultaneously in a mesh-like architecture and – to optimize cost – we were able to keep them fully occupied at ~95-100% utilization.

We were able to saturate GPU utilization and keep the GPU cycles as short as possible, (this is one of the highest-cost components in the architecture). Azure NetApp Files provides various performance tiers that guarantee sustained throughput at sub-ms latency. We started our distributed training job on a small GPU cluster. Later, we added GPUs to the cluster on demand without interrupting the training – using Dynamic Service Level Change capabilities using Run:AI software to provide optimal utilization of these GPUs.

Different DS/DE teams were able to use the same dataset for different projects. One team was able to work on Lane Detection, while another team worked on a different Object Detection task consuming the same data set. Researchers and engineers were able to allocate volumes on demand.

We had full visibility of the AI Infrastructure – Using Run:Ai’s platform, we had full visibility of the AI infrastructure including all pooled GPUs, at the job, project, cluster and node levels.

Looking to get started?

In this use case, lane detection for autonomous vehicles, we were able to use NetApp, Run:AI and Azure to create a single, unified experience for accelerating model training on the cloud, thus reducing costs while improving training times and simplifying processes for data scientists and engineers. Details are available in this technical report and apply to model training across industries and verticals.

Leave a Reply

%d bloggers like this: