Dissertation Talk with Guanhua Wang: Disruptive Research on Distributed Machine Learning Systems; 11:00 AM, Friday, April 15
April 15, 2022
Title: Disruptive Research on Distributed Machine Learning Systems
Speaker: Guanhua Wang
Advisor: Ion Stoica
Date: Friday, April 15, 2022
Time: 11:00am – 12:00pm PT
Location (Zoom): https://berkeley.zoom.us/j/98835045978?pwd=ZXNrdW1oaDE5NXhjYTROS1ZvU3lvQT09
Abstract: Deep Neural Networks (DNNs) enable computers to excel across many different applications such as image classification, speech recognition and robotics control. To accelerate DNN training and serving, parallel computing is widely adopted. System efficiency is a big issue when scaling out.
In this talk, I will make three arguments towards better system efficiency in distributed DNN training and serving.
First, Ring All-Reduce for model synchronization is not optimal, but Blink is. By packing spanning trees rather than forming rings, Blink achieves higher flexibility in arbitrary networking environments and provides near-optimal network throughput. Blink is filed as a US patent and is being used by Microsoft. Blink gains lots of attention from industry, such as Facebook/Meta (PyTorch distributed team), ByteDance (parent company of TikTok app). Blink was also featured on Nvidia GTC China 2019 and news from Baidu, Tencent.
Second, communication can be eliminated via sensAI’s class parallelism. sensAI decouples a multi-task model into disconnected subnets, each is responsible for decision making of a single task. sensAI’s attribute of low-latency, real-time model serving attracts several Venture Capitals in the Bay Area.
Third, Wavelet is more efficient than gang-scheduling. By intentionally adding task launching latency, Wavelet interleaves peak memory usage across different waves of training tasks on the accelerators, and thus it improves both computation and on-device memory utilization. Multiple companies, including Facebook/Meta and Apple, show interests to Wavelet project.