Model parallelism is a standard paradigm to decouple a deep neural network (DNN) into sub-nets when the model is large. Recent advances in class parallelism significantly reduce the communication overhead of model parallelism to a single floating-point number per iteration. However, traditional fault-tolerance schemes, when applied to class parallelism, require storing the entire model on the hard disk. Thus, these schemes are not suitable for soft and frequent system noise such as stragglers(temporarily slow worker machines). In this paper, we propose an erasure-coding based redundant computing technique called robust class parallelism to improve the error resilience of model parallelism. We show that by introducing slight overhead in the computation at each machine, we can obtain robustness to soft system noise more…
Authors: Yaoqing Yang, Jichan Chung, Guanhua Wang, Vipul Gupta, Adarsh Karnati, Kenan Jiang, Ion Stoica, Joseph Gonzalez, Kannan Ramchandran
Convolutional Neural Networks (ConvNets) enable computers to excel on vision learning tasks such as image classification, object detection. Recently, faster inference on live data is becoming more and more important. From a system perspective, it means faster inference on each single, incoming data item (e.g. 1 image). Two main-stream distributed model serving methods – data parallelism and model parallelism – are not desirable here, because we cannot further split a single input data piece via data parallelism and model parallelism introduces huge communication overhead. To achieve low-latency, live data inference, we propose sensAI, a novel and generic approach that decouples a CNN model into disconnected subnets, each is responsible for predicting certain class(es). We call this new model distribution class more…
Authors: Guanhua Wang, Zhuang Liu, Siyuan Zhuang, Brandon Hsieh, Joseph Gonzalez, Ion Stoica
Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for faster data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8x faster model synchronization, and reduce end-to-end training time for image classification tasks by up to 40%.
Authors: Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, Ion Stoica