Efficiently training large deep learning models requires scaling training across a number of GPUs. When training at scale, synchronizing parameters across GPUs introduces significant overheads. To improve synchronization performance, recently available hardware like NVIDIA-DGX1 introduce support for high bandwidth NVLinks across GPUs and software libraries like NCCL implement collective communication primitives like broadcast, all-reduce. However NCCL uses ring-based protocols which do not always use all the available links. To achieve better link utilization, we propose Blink, a family of protocols that use a broadcast-based data transfer mechanism. We describe an AllReduce protocol for the DGX-1 machine and present initial benchmark results that show that Blink can achieve a 2x speedup when compared to NCCL 2.
Published On: February 15, 2018
Presented At/In: SysML 2018