AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere

August 30, 2024

Bin Fan

In the rapidly evolving landscape of AI and machine learning, infra teams face critical challenges in managing large-scale data for AI. Performance bottlenecks, cost inefficiencies, and management complexities pose significant challenges for AI platform teams supporting large-scale model training and serving.

In this talk, Bin Fan will discuss the challenges of I/O stalls that lead to suboptimal GPU utilization during model training. He will present a reference architecture for running PyTorch jobs with Alluxio in cloud environments, demonstrating how this approach can significantly enhance GPU efficiency.

What you will learn:

How to identify GPU utilization and I/O-related performance bottlenecks in model training
Leverage GPU anywhere to maximize resource utilization
Best practices for monitoring and optimizing GPU usage across training and serving pipelines

Video:

Presentation Slides

AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere from Alluxio, Inc.

Videos:

Presentation Slides:

AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training with GPUs Anywhere from Alluxio, Inc.

Complete the form below to access the full overview:

Videos

AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU Workloads in the Cloud

January 23, 2025

AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack

January 23, 2025

AI/ML Infra Meetup | Three Developments in AI Infra

January 23, 2025

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Alluxio Enterprise AI

Alluxio Enterprise Data

Videos:

Presentation Slides:

Complete the form below to access the full overview:

Videos

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer