Efficiently Scale LLM Training Across a Large GPU Cluster with Alpa and Ray

- Introduction: This article discusses challenges in the machine learning system of large language models (LLMs) and proposes an approach to address them using Alpa and Ray.
- Challenges of LLMs: The parameter size of LLMs is too large to fit in the memory of a single device or host, and partitioning the model is necessary for efficient training and inference.
- Introduction to Alpa: Alpa is a compiler that auto-discovers and executes the best parallelism strategies for LLMs using intraoperator and interoperator parallelism.
- Ray Primitives: Ray offers a distributed computing framework that simplifies scaling and management of resources. Ray tasks can be dispatched anywhere in a cluster of machines.
- Advanced Abstractions: Alpa uses Ray actors to create advanced device management abstractions such as DeviceMesh and GPU Buffer. Ray collective communication library enables efficient and flexible tensor movement.
- Pipeline Parallelism Runtime Orchestration: Computation and communication are static in JAX and Alpa. Alpa on Ray is a performant and scalable framework for training LLM models, even at a scale of 175 billion parameters.