PostLN, PreLN and ResiDual Transformers
Vložit
- čas přidán 13. 09. 2024
- PostLN Transformers suffer from unbalanced gradients, leading to unstable training due to vanishing or exploding gradients. Using a learning-rate Warmup stage is considered as a practical solution, but that also requires running more hyper-parameters, making the Transformers training more difficult.
In this video, we will look at some alternatives to the PostLN Transformers, including PreLN Transformer, and the ResiDual, a Transformer with Double Residual Connections.
References:
1. "On Layer Normalization in the Transformer Architecture", Xiong et al., (2020)
2. "Understanding the Difficulty of Training Transformers", Liu et al., (2020)
3. "ResiDual: Transformer with Dual Residual
Connections", Xie et al., (2023)
4. "Learning Deep Transformer Models for Machine Translation", Wang et al., (2019)
thank you for covering all these details, i am a big fan of channel
Thanks for your comment, I am glad you like the channel 👍🏻