NVIDIA Technical Blog

Efficiently Scale LLM Training Across a Large GPU Cluster with Alpa and Ray

1. Introduction: This article discusses challenges in the machine learning system of large language models (LLMs) and proposes an approach to address them using Alpa and Ray.
2. Challenges of LLMs: The parameter size of LLMs is too large to fit in the memory of a single device or host, and partitioning the model is necessary for efficient training and inference.
3. Introduction to Alpa: Alpa is a compiler that auto-discovers and executes the best parallelism strategies for LLMs using intraoperator and interoperator parallelism.
4. Ray Primitives: Ray offers a distributed computing framework that simplifies scaling and management of resources. Ray tasks can be dispatched anywhere in a cluster of machines.
5. Advanced Abstractions: Alpa uses Ray actors to create advanced device management abstractions such as DeviceMesh and GPU Buffer. Ray collective communication library enables efficient and flexible tensor movement.
6. Pipeline Parallelism Runtime Orchestration: Computation and communication are static in JAX and Alpa. Alpa on Ray is a performant and scalable framework for training LLM models, even at a scale of 175 billion parameters.