With data lakes expanding from on-prem to the cloud as well as increasing use of new object data stores, data platform teams are challenged with providing consistent, high-throughput access to distributed data sources for analytics and AI/ML applications. In today’s hybrid cloud and multi-cloud era, data-intensive applications such as Presto, Spark, Hive, and Tensorflow are suffering more sluggish response times and increased complexity with the growing separation of data and compute.
Join Alluxio’s distributed systems experts as they explore today’s data access challenges and open source data orchestration solutions for modernizing your data platform.
In this tech talk, you’ll learn:
- How data access and throughput challenges are hindering large-scale analytics and AI/ML applications
- How a data orchestration layer can simplify distributed data access and improve performance
- Real-world production use cases and example journeys for architecting a modern data platform
ALLUXIO WEBINAR
With data lakes expanding from on-prem to the cloud as well as increasing use of new object data stores, data platform teams are challenged with providing consistent, high-throughput access to distributed data sources for analytics and AI/ML applications. In today’s hybrid cloud and multi-cloud era, data-intensive applications such as Presto, Spark, Hive, and Tensorflow are suffering more sluggish response times and increased complexity with the growing separation of data and compute.
Join Alluxio’s distributed systems experts as they explore today’s data access challenges and open source data orchestration solutions for modernizing your data platform.
In this tech talk, you’ll learn:
- How data access and throughput challenges are hindering large-scale analytics and AI/ML applications
- How a data orchestration layer can simplify distributed data access and improve performance
- Real-world production use cases and example journeys for architecting a modern data platform
Video:
Slides:
Complete the form below to access the full overview:
Videos
TorchTitan is a proof-of-concept for Large-scale LLM training using native PyTorch. It is a repo that showcases PyTorch's latest distributed training features in a clean, minimal codebase.
In this talk, Tianyu will share TorchTitan’s design and optimizations for the Llama 3.1 family of LLMs, spanning 8 billion to 405 billion parameters, and showcase its performance, composability, and scalability.
As large-scale machine learning becomes increasingly GPU-centric, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization. In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.