Efficiently training large deep learning models requires scaling training across a number of GPUs. When training at scale, model synchronization across GPUs introduces significant overheads. To tackle this problem, researchers from Nvidia, Facebook, Uber and Google borrow the idea of collective communication from HPC domain, to develop fast model synchronization schemes (e.g. NCCL from Nvidia, Horovod from Uber, Gloo from Facebook). However, these schemes are still far from optimal. To achieve near-optimal model synchronization performance, we propose Blink, a fast and generic collective communication library for distributed machine learning. Blink is a generalized collective communication library regardless of topology heterogeneity, link heterogeneity (e.g. PCIe/NVLink/NVSwitch, InfiniBand/Ethernet), and hardware heterogeneity (e.g. CPU, GPU). Compare with state-of-the-art scheme (NCCL 2.4.2 released in Jul. 2019), Blink can achieve 2-8x speedup for model synchronization in distributed ML.
- Tencent (Chinese) https://cloud.tencent.com/developer/news/463793
- Sina (Chinese) https://finance.sina.cn/stock/relnews/us/2019-11-23/detail-iihnzhfz1172846.d.html
- Baidu (Chinese) https://baijiahao.baidu.com/s?id=1650980525510615651
- NVIDIA GTC China 2019 https://on-demand.gputechconf.com/gtc-cn/2019/pdf/CN9161/presentation.pdf
- MLSys 2020 (March 3, 2020)
- ByteDance (April 1, 2020)