NVIDIA Technical BlogJune 12, 2024

Demystifying AI Inference Deployments for Trillion Parameter Large Language Models

Introduction
Maximizing User Interactivity
Exploring the Inference Space for Trillion-parameter MoE Models
- Data Parallelism
- Tensor Parallelism
- Pipeline Parallelism
- Expert Parallelism

Introduction

NVIDIA Cloud Partners are providing support to enterprises in their AI deployments, offering the latest generation of NVIDIA GPUs for rapid adoption of services like the NVIDIA AI platform. However, high throughput deployments may lead to low user interactivity, affecting the speed at which users receive readable responses from Large Language Models (LLMs).

Maximizing User Interactivity

To enhance user interactivity, smaller batches of user requests are fed to GPUs, maximizing GPU resource allocation per request. This allows for parallel compute operations and faster token generation but may underutilize GPU resources. Balancing GPU throughput and user interactivity is crucial when deploying trillion-parameter LLMs that cannot fit on a single GPU.

Exploring the Inference Space for Trillion-parameter MoE Models

For models like the GPT 1.8T MoE with 16 experts and a fixed budget of 64 GPUs, different parallelism methods can be employed to distribute inference work. These methods affect GPU throughput and user interactivity differently:

Data Parallelism

Hosts multiple model copies on different GPUs to independently process user requests.
Each GPU runs its copy of the model, not affecting GPU throughput or user interactivity significantly.
Often used in combination with other parallelism methods due to LLM weight limitations.

Tensor Parallelism

Splits each model layer across multiple GPUs to share user requests.
Recombines results over a GPU-to-GPU network, offering 73 configurations with varied throughput and user interactivity tradeoffs.
Less computationally intensive but may underutilize GPU compute resources due to high memory bandwidth requirements.