NVIDIA Technical Blog

NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference

thumbnail
  • NVLink and NVSwitch for Supercharging Large Language Model Inference:

    • Multi-GPU compute is essential for meeting real-time latency requirements for serving large language models (LLMs).
    • NVIDIA NVLink and NVSwitch enable faster communication between GPUs, enhancing multi-GPU scaling and overall inference throughput.
    • Tensor parallelism across multiple GPUs provides high throughput for real-time responses.
  • NVSwitch's Role in Fast Multi-GPU LLM Inference:

    • NVIDIA Hopper Architecture GPUs communicate at 900 GB/s with NVLink, and NVSwitch allows simultaneous communication at this speed between all GPUs in a server.
    • NVSwitch boosts inference throughput, especially for models that benefit from greater GPU-to-GPU communication traffic.
    • Real-time inference throughput with NVSwitch on NVIDIA H200 GPUs with tensor parallelism (TP=2) is up to 1.5x higher compared to configurations without NVSwitch.
  • Continued NVLink Innovation for Trillion-Parameter Model Inference:

    • NVIDIA GB200 NVL72 system connects 36 Grace CPUs and 72 Blackwell GPUs using fifth-generation NVLink.
    • This setup enables all 72 GPUs to act as a single unit, achieving 30x faster real-time trillion-parameter inference compared to previous generations.

위 요약들을 마크다운 형식에 맞게 조합하면 아래와 같다.

- NVLink and NVSwitch for Supercharging Large Language Model Inference:
  - Multi-GPU compute is essential for meeting real-time latency requirements for serving large language models (LLMs).
  - NVIDIA NVLink and NVSwitch enable faster communication between GPUs, enhancing multi-GPU scaling and overall inference throughput.
  - Tensor parallelism across multiple GPUs provides high throughput for real-time responses.

- NVSwitch's Role in Fast Multi-GPU LLM Inference:
  - NVIDIA Hopper Architecture GPUs communicate at 900 GB/s with NVLink, and NVSwitch allows simultaneous communication at this speed between all GPUs in a server.
  - NVSwitch boosts inference throughput, especially for models that benefit from greater GPU-to-GPU communication traffic.
  - Real-time inference throughput with NVSwitch on NVIDIA H200 GPUs with tensor parallelism (TP=2) is up to 1.5x higher compared to configurations without NVSwitch.

- Continued NVLink Innovation for Trillion-Parameter Model Inference:
  - NVIDIA GB200 NVL72 system connects 36 Grace CPUs and 72 Blackwell GPUs using fifth-generation NVLink.
  - This setup enables all 72 GPUs to act as a single unit, achieving 30x faster real-time trillion-parameter inference compared to previous generations.