PostLN, PreLN and ResiDual Transformers

Sdílet
Vložit
  • čas přidán 13. 09. 2024
  • PostLN Transformers suffer from unbalanced gradients, leading to unstable training due to vanishing or exploding gradients. Using a learning-rate Warmup stage is considered as a practical solution, but that also requires running more hyper-parameters, making the Transformers training more difficult.
    In this video, we will look at some alternatives to the PostLN Transformers, including PreLN Transformer, and the ResiDual, a Transformer with Double Residual Connections.
    References:
    1. "On Layer Normalization in the Transformer Architecture", Xiong et al., (2020)
    2. "Understanding the Difficulty of Training Transformers", Liu et al., (2020)
    3. "ResiDual: Transformer with Dual Residual
    Connections", Xie et al., (2023)
    4. "Learning Deep Transformer Models for Machine Translation", Wang et al., (2019)

Komentáře • 2

  • @buh357
    @buh357 Před 4 měsíci

    thank you for covering all these details, i am a big fan of channel

    • @PyMLstudio
      @PyMLstudio  Před 4 měsíci +1

      Thanks for your comment, I am glad you like the channel 👍🏻