Building Scalable and Fault-Tolerant NCCL Applications

The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale…

The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale from just a few GPUs on a single host to thousands of GPUs in a data center. This post discusses NCCL features that support run-time rescaling for cost optimization, as well as minimizing service downtime from faults by dynamically removing…

Source

Building Scalable and Fault-Tolerant NCCL Applications

About

Leave a Reply Cancel reply

Building Scalable and Fault-Tolerant NCCL Applications

Leave a Reply Cancel reply

Related Posts