NVIDIA Introduces NVSHMEM 3.0 with Enhanced GPU Communication Features

NVIDIA Introduces NVSHMEM 3.0 with Enhanced GPU Communication Features

Jessie A Ellis Sep 07, 2024 08:39

NVIDIA’s NVSHMEM 3.0 offers multi-node support, ABI backward compatibility, and CPU-assisted InfiniBand GPU Direct Async, enhancing GPU communication.

NVIDIA Introduces NVSHMEM 3.0 with Enhanced GPU Communication Features

NVIDIA has announced the release of NVSHMEM 3.0, the latest version of its parallel programming interface designed to facilitate efficient and scalable communication for NVIDIA GPU clusters. This update, part of NVIDIA Magnum IO and based on OpenSHMEM, aims to enhance application portability and compatibility across various platforms, according to the NVIDIA Technical Blog.

New Features and Interface Support

NVSHMEM 3.0 introduces several new features, including multi-node, multi-interconnect support, host-device ABI backward compatibility, and CPU-assisted InfiniBand GPU Direct Async (IBGDA).

Multi-Node, Multi-Interconnect Support

The new version supports connectivity between multiple GPUs within a node over P2P interconnects, such as NVIDIA NVLink/PCIe, and across nodes using RDMA interconnects like InfiniBand and RDMA over Converged Ethernet (RoCE). This enhancement includes platform support for multiple racks of NVIDIA GB200 NVL72 systems connected through RDMA networks.

Host-Device ABI Backward Compatibility

NVSHMEM 3.0 introduces backward compatibility across minor versions, allowing applications linked to an older version of NVSHMEM to run on systems with newer versions. This feature facilitates smoother updates and reduces the need for recompiling applications with each new release.

CPU-Assisted InfiniBand GPU Direct Async

The latest release also supports CPU-assisted IBGDA, which divides control plane responsibilities between the GPU and CPU. This approach helps improve IBGDA adoption on non-coherent platforms and relaxes administrative-level configuration constraints in large-scale clusters.

Non-Interface Support and Minor Enhancements

NVSHMEM 3.0 includes minor enhancements and non-interface support, such as:

Object-Oriented Programming Framework for Symmetric Heap

This version introduces an object-oriented programming (OOP) framework to manage different kinds of symmetric heaps, including static and dynamic device memory. The OOP framework simplifies the extension to advanced features and improves data encapsulation.

Performance Improvements and Bug Fixes

NVSHMEM 3.0 brings various performance improvements and bug fixes, including enhancements in IBGDA setup, block-scoped on-device reductions, system-scoped atomic memory operation (AMO), and team management.

Summary

The release of NVSHMEM 3.0 marks a significant upgrade in NVIDIA’s parallel programming interface. Key features such as multi-node multi-interconnect support, host-device ABI backward compatibility, and CPU-assisted IBGDA aim to enhance GPU communication and application portability. Administrators and developers can now update to newer versions of NVSHMEM without disrupting existing applications, ensuring smoother transitions and better performance in large-scale GPU clusters.

Image source: Shutterstock