AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere
August 30, 2024
By 
Bin Fan

In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.

In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.

What you will learn:

  • How to identify GPU utilization and I/O-related performance bottlenecks in model training
  • Leverage GPU anywhere to maximize resource utilization
  • Best practices for monitoring and optimizing GPU usage across training and serving pipelines

Video:

Presentation Slides

Videos:
Presentation Slides:

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer