Arcadia: An end-to-end AI system performance simulator

Arcadia: An end-to-end AI system performance simulator
Arcadia is a unified system developed by Meta that simulates the performance of AI training clusters. It models the compute, memory, and network performance of the clusters, providing valuable insights to researchers and engineers. This allows for data-driven decision making in the design of AI clusters.
Our operational workflows
The availability of the infrastructure underlying the AI training clusters can greatly impact training time. For example, a component failure can result in the loss of progress. Therefore, it is important to have a common source of truth to ensure efficient workflow.
The Arcadia system
The Arcadia system is designed to create a unified simulation framework that accurately models the performance of compute, memory, and network components within large-scale AI training clusters. It takes into account parameters such as network topology, workload distribution, job scheduling, and hardware specifications.
Inputs
The Arcadia system takes various parameters as input, including long-range plans of AI systems and models, network topology and routing protocols, data center floor plans, AI workload distributions, and hardware specifications.
Core
At the core of the Arcadia system is an orchestrator that coordinates the simulation of job scheduling, compute and memory, and network behavior at different levels. This allows stakeholders to analyze the impact of different factors and make informed decisions to optimize system performance.
Arcadia's benefits
Arcadia provides operational insights and flexibility in simulation, allowing for the optimization of AI clusters. Some use cases for Arcadia include cluster utilization and fragmentation insights, measuring the impact of network and hardware on job performance, AI job profile analysis, and assessing the reliability, availability, and efficiency of training clusters.
Next steps for Arcadia
While Arcadia already provides valuable insights, there are plans to expand its capabilities. This includes developing frameworks to optimize training cluster maintenance and AI job scheduling and configurations. Additionally, there is an investigation into creating a framework for designing different topology and routing designs to optimize compute, memory, and network performance for specific clusters.