Model parallelism is a standard paradigm to decouple a deep neural network (DNN) into sub-nets when the model is large. Recent advances in class parallelism significantly reduce the communication overhead of model parallelism to a single floating-point number per iteration. However, traditional fault-tolerance schemes, when applied to class parallelism, require storing the entire model on the hard disk. Thus, these schemes are not suitable for soft and frequent system noise such as stragglers(temporarily slow worker machines). In this paper, we propose an erasure-coding based redundant computing technique called robust class parallelism to improve the error resilience of model parallelism. We show that by introducing slight overhead in the computation at each machine, we can obtain robustness to soft system noise while maintaining the low communication overhead in class parallelism. More importantly, we show that on standard classification tasks, robust class parallelism maintains the state-of-the-art performance.
Published On: November 1, 2020
Presented At/In: 54th Asilomar Conference on Signals, Systems and Computers (Asilomar 2020)