Hardware-aware Algorithms for Sequence Modeling - Tri Dao | Stanford MLSys #87

Sdílet
Vložit
  • čas přidán 24. 07. 2024
  • Episode 87 of the Stanford MLSys Seminar Series!
    Hardware-aware Algorithms for Sequence Modeling
    Speaker: Tri Dao
    Abstract:
    Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length.
    In the first half, we describe attention approximation algorithms using sparsity and low-rank structures, as well as algorithms (e.g. FlashAttention) to achieve fast and memory-efficient exact attention. By making attention algorithms IO-aware (accounting for reads and writes between levels of GPU memory) one can speed up attention by 4-8x, enabling 4-16x longer context in Transformers and yielding higher quality models. We will also describe optimizations for long-context LLM inference, leading to 2-8x faster end-to-end inference time.
    In the second half, we describe recent progress on subquadratic-time architectures such as RNNs, gated convolution, and structured state space models (SSMs). We identify that a key weakness of such models is their inability to perform content-based reasoning, and propose a selection mechanism to address this shortcoming. Though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture (Mamba) without attention or even MLP blocks. Mamba matches or exceeds the performance of strong modern Transformers on language modeling.
    Bio:
    Tri Dao is an incoming Assistant Professor at Princeton University and is currently chief scientist of Together AI. He completed his PhD in Computer Science at Stanford, co-advised by Christopher Ré and Stefano Ermon. He works at the intersection of machine learning and systems, and his research interests include sequence models with long-range memory and structured matrices for compact deep learning models. His work has received the ICML 2022 Outstanding paper runner-up award.
    --
    Stanford MLSys Seminar hosts: Avanika Narayan, Benjamin Spector, Michael Zhang
    Twitter:
    / avanika15​
    / bfspector
    / mzhangio
    --
    Check out our website for the schedule: mlsys.stanford.edu
    Join our mailing list to get weekly updates: groups.google.com/forum/#!for...
    #machinelearning #ai #artificialintelligence #systems #mlsys #computerscience #stanford
  • Věda a technologie

Komentáře • 5

  • @jjh5474
    @jjh5474 Před 6 měsíci +1

    Thank you for sharing this insightful video. In the introduction of Mamba, it says "parellelizable training", can you explain how parallel training is possible in an autoregressive model?

    • @robertjflynn4206
      @robertjflynn4206 Před 6 měsíci

      Teacher forcing

    • @icriou
      @icriou Před 5 měsíci

      Follow this video and you will have hands on understanding why AR model could be trained in parallel. czcams.com/video/kCc8FmEb1nY/video.html

    • @matthewnorton2315
      @matthewnorton2315 Před 5 měsíci

      I think you might be looking for the "selective scan" part of Mamba. In section 3.3.2 of the paper arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf, they say "To avoid the sequential recurrence, we observe that despite not being linear it can still be parallelized with a
      work-efficient parallel scan algorithm (Blelloch 1990; Martin and Cundy 2018; Smith, Warrington, and Linderman
      2023)". In short, they use a well known parallel algorithm trick to calculate a prefix sum. See en.wikipedia.org/wiki/Prefix_sum#Parallel_algorithms and you'll notice the similarity. Hope this helps!

  • @ostrov11
    @ostrov11 Před 20 dny

    ... какие то откровения ML джуна