Enhancing Application Portability and Compatibility across New Platforms Using NVIDIA Magnum IO NVSHMEM 3.0

Table of Contents
- New Features and Interface Support in NVSHMEM 3.0
- CPU-assisted InfiniBand GPU Direct Async
- Performance Improvements and Bug Fixes
- Summary
New Features and Interface Support in NVSHMEM 3.0
- Introduction of multi-node, multi-interconnect support
- Host-device ABI backward compatibility
- CPU-assisted InfiniBand GPU Direct Async (IBGDA)
- Platform support for multiple racks of NVIDIA GB200 NVL72 systems connected through RDMA networks
- Enhanced capabilities for NVLink communication within the same NVLink clique spanning multiple nodes
- Backwards compatibility across NVSHMEM minor versions
CPU-assisted InfiniBand GPU Direct Async
- Introduction of CPU-assisted IBGDA as an intermediate mode between proxy-based networking and traditional IBGDA
- Split responsibilities of the control plane between the GPU and CPU
Performance Improvements and Bug Fixes
- Various performance enhancements and bug fixes across different components and scenarios
Summary
The release of NVIDIA NVSHMEM 3.0 introduces significant new features and improvements, including support for multi-node, multi-interconnect setups, host-device ABI backward compatibility, CPU-assisted InfiniBand GPU Direct Async, and an object-oriented programming framework for symmetric heap management. These enhancements aim to enhance application portability and compatibility across new platforms, providing more efficient and scalable communication for NVIDIA GPU clusters.