NVIDIA Technical BlogAugust 12, 2024

NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference

NVLink and NVSwitch for Supercharging Large Language Model Inference:
- Multi-GPU compute is essential for meeting real-time latency requirements for serving large language models (LLMs).
- NVIDIA NVLink and NVSwitch enable faster communication between GPUs, enhancing multi-GPU scaling and overall inference throughput.
- Tensor parallelism across multiple GPUs provides high throughput for real-time responses.
NVSwitch's Role in Fast Multi-GPU LLM Inference:
- NVIDIA Hopper Architecture GPUs communicate at 900 GB/s with NVLink, and NVSwitch allows simultaneous communication at this speed between all GPUs in a server.
- NVSwitch boosts inference throughput, especially for models that benefit from greater GPU-to-GPU communication traffic.
- Real-time inference throughput with NVSwitch on NVIDIA H200 GPUs with tensor parallelism (TP=2) is up to 1.5x higher compared to configurations without NVSwitch.
Continued NVLink Innovation for Trillion-Parameter Model Inference:
- NVIDIA GB200 NVL72 system connects 36 Grace CPUs and 72 Blackwell GPUs using fifth-generation NVLink.
- This setup enables all 72 GPUs to act as a single unit, achieving 30x faster real-time trillion-parameter inference compared to previous generations.

위 요약들을 마크다운 형식에 맞게 조합하면 아래와 같다.

- NVLink and NVSwitch for Supercharging Large Language Model Inference:
  - Multi-GPU compute is essential for meeting real-time latency requirements for serving large language models (LLMs).
  - NVIDIA NVLink and NVSwitch enable faster communication between GPUs, enhancing multi-GPU scaling and overall inference throughput.
  - Tensor parallelism across multiple GPUs provides high throughput for real-time responses.

- NVSwitch's Role in Fast Multi-GPU LLM Inference:
  - NVIDIA Hopper Architecture GPUs communicate at 900 GB/s with NVLink, and NVSwitch allows simultaneous communication at this speed between all GPUs in a server.
  - NVSwitch boosts inference throughput, especially for models that benefit from greater GPU-to-GPU communication traffic.
  - Real-time inference throughput with NVSwitch on NVIDIA H200 GPUs with tensor parallelism (TP=2) is up to 1.5x higher compared to configurations without NVSwitch.

- Continued NVLink Innovation for Trillion-Parameter Model Inference:
  - NVIDIA GB200 NVL72 system connects 36 Grace CPUs and 72 Blackwell GPUs using fifth-generation NVLink.
  - This setup enables all 72 GPUs to act as a single unit, achieving 30x faster real-time trillion-parameter inference compared to previous generations.