NVIDIA Technical Blog

Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0

thumbnail

Table of Contents

  1. New Features and Interface Support in NVSHMEM 3.0
  2. CPU-assisted InfiniBand GPU Direct Async
  3. Performance Improvements and Bug Fixes
  4. Summary

New Features and Interface Support in NVSHMEM 3.0

  • Introduction of multi-node, multi-interconnect support
  • Host-device ABI backward compatibility
  • CPU-assisted InfiniBand GPU Direct Async (IBGDA)
  • Platform support for multiple racks of NVIDIA GB200 NVL72 systems connected through RDMA networks
  • Enhanced capabilities for NVLink communication within the same NVLink clique spanning multiple nodes
  • Backwards compatibility across NVSHMEM minor versions

CPU-assisted InfiniBand GPU Direct Async

  • Introduction of CPU-assisted IBGDA as an intermediate mode between proxy-based networking and traditional IBGDA
  • Split responsibilities of the control plane between the GPU and CPU

Performance Improvements and Bug Fixes

  • Various performance enhancements and bug fixes across different components and scenarios

Summary

The release of NVIDIA NVSHMEM 3.0 introduces significant new features and improvements, including support for multi-node, multi-interconnect setups, host-device ABI backward compatibility, CPU-assisted InfiniBand GPU Direct Async, and an object-oriented programming framework for symmetric heap management. These enhancements aim to enhance application portability and compatibility across new platforms, providing more efficient and scalable communication for NVIDIA GPU clusters.