Meta Engineering

A RoCE network for distributed AI training at scale

thumbnail

Table of Contents

  1. Introduction
  2. Design of AI Zone
  3. Challenges in Traffic Distribution
  4. Mitigation Strategies
  5. Performance Improvement with E-ECMP and QP Scaling
  6. Hierarchical Collectives and ECMP Scalability
  7. Transport-Level Congestion Control and Performance Stability

1. Introduction

Our paper, "RDMA over Ethernet for Distributed AI Training at Meta Scale," discusses the design, implementation, and operation of a RoCEv2 network for distributed AI training. RDMA over Converged Ethernet (RoCEv2) serves as the inter-node communication transport for our AI network, supporting large-scale AI models such as LLAMA 3.1 405B.

2. Design of AI Zone

We implemented a two-stage Clos topology called an AI Zone to support a large number of interconnected GPUs in a non-blocking manner. Each AI Zone connects to both the frontend (FE) and backend (BE) of the data center network, utilizing RoCEv2 for communication.

3. Challenges in Traffic Distribution

Fragmented job placements led to uneven traffic distribution and congestion on uplinks, causing performance degradation of over 30%. Network failures on uplinks or cross switches (CTSWs) resulted in flow reassignment, further impacting training performance.

4. Mitigation Strategies

To address these challenges, we upgraded the RTSW uplinks bandwidth by 2x, improving performance by up to 40% for AllReduce collective. Enhanced ECMP (E-ECMP) with QP scaling helped alleviate flow collisions and improve overall network efficiency.

5. Performance Improvement with E-ECMP and QP Scaling

By splitting messages across multiple QPs and increasing network flows, we enhanced ECMP performance and scalability. However, the probabilistic nature of hashing in ECMP remained a challenge.

6. Hierarchical Collectives and ECMP Scalability

Hierarchical collectives like AllReduce benefited from QP scaling, enhancing network efficiency. Default DCQCN settings and doubled ECN thresholds improved performance compared to 200G networks.

7. Transport-Level Congestion Control and Performance Stability

Stability and lack of persistent congestion were observed in training collectives with PFC as the sole flow control mechanism. Transport-level congestion control mechanisms were not implemented, as stable performance was maintained over a year of operation.


이렇게 되면 원하는대로 목차와 각 요약이 분류되어 보이게 될거야. 만약 수정이 필요하면 언제든지 말해.