Announcing NVIDIA DGX GH200: The First 100 Terabyte GPU Memory System

NVIDIA DGX GH200: The First 100 Terabyte GPU Memory System
- NVIDIA introduces their newest GPU system, the DGX GH200, which is the first supercomputer to break the 100-terabyte barrier for memory accessible to GPUs over NVLink.
- The DGX GH200 system is designed to empower scientists in need of an advanced platform that can solve extraordinary challenges.
- The system is built with NVIDIA Grace Hopper Superchip and NVLink Switch System, which combines to form a giant data center-sized GPU.
- The architecture allows for up to 256 GPUs in a DGX GH200 system, which provides nearly 500x more memory to the GPU shared memory programming model over NVLink compared to a single NVIDIA DGX A100 320 GB system.
- NVIDIA Base Command provides an OS optimized for AI workloads, a cluster manager, and libraries that accelerate compute, storage, and network infrastructure.
NVLink Technology and Unified Memory Programming Model
- NVLink technology was introduced in 2016, along with the Unified Memory Programming model with CUDA-6, designed to increase the memory available to GPU-accelerated workloads.
- The core of every DGX system is a GPU complex on a baseboard interconnected with NVLink, allowing each GPU to access the other’s memory at NVLink speed.
NVIDIA Grace Hopper Superchip and NVLink Switch System
- NVIDIA Grace Hopper Superchip combines the Grace and Hopper architectures using NVIDIA NVLink-C2C to deliver a CPU + GPU coherent memory model.
- NVIDIA Grace CPU and Hopper GPU are interconnected with NVLink-C2C, providing 7x more bandwidth than PCIe Gen5 at one-fifth the power.
- NVLink Switch System forms a two-level, non-blocking, fat-tree NVLink fabric to fully connect 256 Grace Hopper Superchips in a DGX GH200 system.
- Every GPU in DGX GH200 can access the memory of other GPUs and extended GPU memory of all NVIDIA Grace CPUs at 900 GBps.
Full-Stack NVIDIA Solution
- DGX GH200 comes with NVIDIA Base Command, which includes an OS optimized for AI workloads, cluster manager, and libraries that accelerate compute, storage, and network infrastructure.
- BlueField-3 DPUs can transform any enterprise computing environment into a secure and accelerated virtual private cloud, enabling organizations to run application workloads in secure, multi-tenant environments.
Supercharging AI and HPC Workloads
- The power of BlueField-3 DPUs makes the DGX H100 the most performance-efficient training solution for enterprise workloads.
- The DGX GH200 is a better solution for more advanced AI and HPC models that require massive memory for GPU shared memory programming.
- NVIDIA is working to make DGX GH200 available at the end of this year.