Announcing NVIDIA DGX GH200: The First 100 Terabyte GPU Memory System

thumbnail

NVIDIA DGX GH200: The First 100 Terabyte GPU Memory System

  • NVIDIA introduces their newest GPU system, the DGX GH200, which is the first supercomputer to break the 100-terabyte barrier for memory accessible to GPUs over NVLink.
  • The DGX GH200 system is designed to empower scientists in need of an advanced platform that can solve extraordinary challenges.
  • The system is built with NVIDIA Grace Hopper Superchip and NVLink Switch System, which combines to form a giant data center-sized GPU.
  • The architecture allows for up to 256 GPUs in a DGX GH200 system, which provides nearly 500x more memory to the GPU shared memory programming model over NVLink compared to a single NVIDIA DGX A100 320 GB system.
  • NVIDIA Base Command provides an OS optimized for AI workloads, a cluster manager, and libraries that accelerate compute, storage, and network infrastructure.

NVLink Technology and Unified Memory Programming Model

  • NVLink technology was introduced in 2016, along with the Unified Memory Programming model with CUDA-6, designed to increase the memory available to GPU-accelerated workloads.
  • The core of every DGX system is a GPU complex on a baseboard interconnected with NVLink, allowing each GPU to access the other’s memory at NVLink speed.

NVIDIA Grace Hopper Superchip and NVLink Switch System

  • NVIDIA Grace Hopper Superchip combines the Grace and Hopper architectures using NVIDIA NVLink-C2C to deliver a CPU + GPU coherent memory model.
  • NVIDIA Grace CPU and Hopper GPU are interconnected with NVLink-C2C, providing 7x more bandwidth than PCIe Gen5 at one-fifth the power.
  • NVLink Switch System forms a two-level, non-blocking, fat-tree NVLink fabric to fully connect 256 Grace Hopper Superchips in a DGX GH200 system.
  • Every GPU in DGX GH200 can access the memory of other GPUs and extended GPU memory of all NVIDIA Grace CPUs at 900 GBps.

Full-Stack NVIDIA Solution

  • DGX GH200 comes with NVIDIA Base Command, which includes an OS optimized for AI workloads, cluster manager, and libraries that accelerate compute, storage, and network infrastructure.
  • BlueField-3 DPUs can transform any enterprise computing environment into a secure and accelerated virtual private cloud, enabling organizations to run application workloads in secure, multi-tenant environments.

Supercharging AI and HPC Workloads

  • The power of BlueField-3 DPUs makes the DGX H100 the most performance-efficient training solution for enterprise workloads.
  • The DGX GH200 is a better solution for more advanced AI and HPC models that require massive memory for GPU shared memory programming.
  • NVIDIA is working to make DGX GH200 available at the end of this year.