Disrupting Distributed ML feat. Guanhua Wang | Stanford MLSys Seminar Episode 25

Sdílet
Vložit
  • čas přidán 25. 07. 2024
  • Episode 25 of the Stanford MLSys Seminar Series!
    Disruptive Research on Distributed ML Systems
    Speaker: Guanhua Wang
    Abstract:
    Deep Neural Networks (DNNs) enable computers to excel across many different applications such as image classification, speech recognition and robotics control. To accelerate DNN training and serving, parallel computing is widely adopted. System efficiency is a big issue when scaling out. In this talk, I will make three arguments towards better system efficiency in distributed DNN training and serving. First, Ring All-Reduce for model synchronization is not optimal, but Blink is. By packing spanning trees rather than forming rings, Blink achieves higher flexibility in arbitrary networking environments and provides near-optimal network throughput. Blink is filed as a US patent and is being used by Microsoft. Blink gains lots of attention from industry, such as Facebook (distributed PyTorch team), ByteDance (parent company of TikTok app). Blink was also featured on Nvidia GTC China 2019 and news from Baidu, Tencent. Second, communication can be eliminated via sensAI's class parallelism. sensAI decouples a multi-task model into disconnected subnets, each is responsible for decision making of a single task. sensAI's attribute of low-latency, real-time model serving attracts several Venture Capitals in the Bay Area. Third, Wavelet is more efficient than gang-scheduling. By intentionally adding task launching latency, Wavelet interleaves peak memory usage across different waves of training tasks on the accelerators, and thus it improves both computation and on-device memory utilization. Multiple companies, including Facebook and Apple, show interests to Wavelet project.
    Speaker bio:
    Guanhua Wang is a final year CS PhD in the RISELab at UC Berkeley, advised by Prof. Ion Stoica. His research lies primarily in the ML+Systems area including fast collective communication schemes for model synchronization, efficient in-parallel model training and real-time model serving.
    --
    0:00 Starting soon
    4:23 Presentation
    36:50 Discussion
    The Stanford MLSys Seminar is hosted by Dan Fu, Karan Goel, Fiodar Kazhamiaka, and Piero Molino, Chris Ré, and Matei Zaharia.
    Twitter:
    / realdanfu​
    / krandiash​
    / w4nderlus7
    --
    Check out our website for the schedule: mlsys.stanford.edu
    Join our mailing list to get weekly updates: groups.google.com/forum/#!for...
    #machinelearning #ai #artificialintelligence #systems #mlsys #computerscience #stanford #berkeley

Komentáře • 1

  • @tingsun5547
    @tingsun5547 Před 10 měsíci

    Great talk. Thanks, Dr. Wang!