3Blue1Brown
3Blue1Brown
  • 175
  • 498 551 867
Attention in transformers, visually explained | Chapter 6, Deep Learning
Demystifying attention, the key mechanism inside transformers and LLMs.
Instead of sponsored ad reads, these lessons are funded directly by viewers: 3b1b.co/support
Special thanks to these supporters: www.3blue1brown.com/lessons/attention#thanks
An equally valuable form of support is to simply share the videos.
Demystifying self-attention, multiple heads, and cross-attention.
Instead of sponsored ad reads, these lessons are funded directly by viewers: 3b1b.co/support
The first pass for the translated subtitles here is machine-generated, and therefore notably imperfect. To contribute edits or fixes, visit translate.3blue1brown.com/
And yes, at 22:00 (and elsewhere), "breaks" is a typo.
------------------
Here are a few other relevant resources
Build a GPT from scratch, by Andrej Karpathy
czcams.com/video/kCc8FmEb1nY/video.html
If you want a conceptual understanding of language models from the ground up, @vcubingx just started a short series of videos on the topic:
czcams.com/video/1il-s4mgNdI/video.htmlsi=XaVxj6bsdy3VkgEX
If you're interested in the herculean task of interpreting what these large networks might actually be doing, the Transformer Circuits posts by Anthropic are great. In particular, it was only after reading one of these that I started thinking of the combination of the value and output matrices as being a combined low-rank map from the embedding space to itself, which, at least in my mind, made things much clearer than other sources.
transformer-circuits.pub/2021/framework/index.html
Site with exercises related to ML programming and GPTs
www.gptandchill.ai/codingproblems
History of language models by Brit Cruise, @ArtOfTheProblem
czcams.com/video/OFS90-FX6pg/video.html
An early paper on how directions in embedding spaces have meaning:
arxiv.org/pdf/1301.3781.pdf
------------------
Timestamps:
0:00 - Recap on embeddings
1:39 - Motivating examples
4:29 - The attention pattern
11:08 - Masking
12:42 - Context size
13:10 - Values
15:44 - Counting parameters
18:21 - Cross-attention
19:19 - Multiple heads
22:16 - The output matrix
23:19 - Going deeper
24:54 - Ending
------------------
These animations are largely made using a custom Python library, manim. See the FAQ comments here:
3b1b.co/faq#manim
github.com/3b1b/manim
github.com/ManimCommunity/manim/
All code for specific videos is visible here:
github.com/3b1b/videos/
The music is by Vincent Rubinetti.
www.vincentrubinetti.com
vincerubinetti.bandcamp.com/album/the-music-of-3blue1brown
open.spotify.com/album/1dVyjwS8FBqXhRunaG5W5u
------------------
3blue1brown is a channel about animating math, in all senses of the word animate. If you're reading the bottom of a video description, I'm guessing you're more interested than the average viewer in lessons here. It would mean a lot to me if you chose to stay up to date on new ones, either by subscribing here on CZcams or otherwise following on whichever platform below you check most regularly.
Mailing list: 3blue1brown.substack.com
Twitter: 3blue1brown
Instagram: 3blue1brown
Reddit: www.reddit.com/r/3blue1brown
Facebook: 3blue1brown
Patreon: patreon.com/3blue1brown
Website: www.3blue1brown.com
zhlédnutí: 840 279

Video

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
zhlédnutí 2,2MPřed měsícem
Unpacking how large language models work under the hood Early view of the next chapter for patrons: 3b1b.co/early-attention Special thanks to these supporters: 3b1b.co/lessons/gpt#thanks To contribute edits to the subtitles, visit translate.3blue1brown.com/ Other recommended resources on the topic. Richard Turner's introduction is one of the best starting places: arxiv.org/pdf/2304.10557.pdf Co...
4 questions about the refractive index | Optics puzzles 4
zhlédnutí 644KPřed 4 měsíci
4 questions about the refractive index | Optics puzzles 4
But why would light "slow down"? | Optics puzzles 3
zhlédnutí 1,2MPřed 5 měsíci
But why would light "slow down"? | Optics puzzles 3
25 Math explainers you may enjoy | SoME3 results
zhlédnutí 542KPřed 6 měsíci
25 Math explainers you may enjoy | SoME3 results
Explaining the barber pole effect from origins of light | Optics puzzles 2
zhlédnutí 674KPřed 8 měsíci
Explaining the barber pole effect from origins of light | Optics puzzles 2
Polarized light in sugar water | Optics puzzles 1
zhlédnutí 998KPřed 8 měsíci
Polarized light in sugar water | Optics puzzles 1
A pretty reason why Gaussian + Gaussian = Gaussian
zhlédnutí 751KPřed 9 měsíci
A pretty reason why Gaussian Gaussian = Gaussian
This pattern breaks, but for a good reason | Moser's circle problem
zhlédnutí 1,9MPřed 10 měsíci
This pattern breaks, but for a good reason | Moser's circle problem
How They Fool Ya (live) | Math parody of Hallelujah
zhlédnutí 940KPřed 10 měsíci
How They Fool Ya (live) | Math parody of Hallelujah
Convolutions | Why X+Y in probability is a beautiful mess
zhlédnutí 625KPřed 10 měsíci
Convolutions | Why X Y in probability is a beautiful mess
Why π is in the normal distribution (beyond integral tricks)
zhlédnutí 1,5MPřed rokem
Why π is in the normal distribution (beyond integral tricks)
But what is the Central Limit Theorem?
zhlédnutí 3,3MPřed rokem
But what is the Central Limit Theorem?
But what is a convolution?
zhlédnutí 2,5MPřed rokem
But what is a convolution?
Researchers thought this was a bug (Borwein integrals)
zhlédnutí 3,3MPřed rokem
Researchers thought this was a bug (Borwein integrals)
What makes a great math explanation? | SoME2 results
zhlédnutí 735KPřed rokem
What makes a great math explanation? | SoME2 results
How to lie using visual proofs
zhlédnutí 3,1MPřed rokem
How to lie using visual proofs
Olympiad level counting (Generating functions)
zhlédnutí 1,9MPřed rokem
Olympiad level counting (Generating functions)
Oh, wait, actually the best Wordle opener is not “crane”…
zhlédnutí 6MPřed 2 lety
Oh, wait, actually the best Wordle opener is not “crane”…
Solving Wordle using information theory
zhlédnutí 10MPřed 2 lety
Solving Wordle using information theory
A tale of two problem solvers (Average cube shadows)
zhlédnutí 2,7MPřed 2 lety
A tale of two problem solvers (Average cube shadows)
2021 Summer of Math Exposition results
zhlédnutí 776KPřed 2 lety
2021 Summer of Math Exposition results
Beyond the Mandelbrot set, an intro to holomorphic dynamics
zhlédnutí 1,4MPřed 2 lety
Beyond the Mandelbrot set, an intro to holomorphic dynamics
From Newton’s method to Newton’s fractal (which Newton knew nothing about)
zhlédnutí 2,8MPřed 2 lety
From Newton’s method to Newton’s fractal (which Newton knew nothing about)
The Summer of Math Exposition
zhlédnutí 721KPřed 2 lety
The Summer of Math Exposition
A quick trick for computing eigenvalues | Chapter 15, Essence of linear algebra
zhlédnutí 978KPřed 2 lety
A quick trick for computing eigenvalues | Chapter 15, Essence of linear algebra
How (and why) to raise e to the power of a matrix | DE6
zhlédnutí 2,7MPřed 3 lety
How (and why) to raise e to the power of a matrix | DE6
The medical test paradox, and redesigning Bayes' rule
zhlédnutí 1,2MPřed 3 lety
The medical test paradox, and redesigning Bayes' rule
Hamming codes part 2: The one-line implementation
zhlédnutí 834KPřed 3 lety
Hamming codes part 2: The one-line implementation
But what are Hamming codes? The origin of error correction
zhlédnutí 2,3MPřed 3 lety
But what are Hamming codes? The origin of error correction

Komentáře

  • @chezlizzle
    @chezlizzle Před 2 hodinami

    Highly recommend for any classical mechanic enthusiasts. Great video.

  • @NemripNGC
    @NemripNGC Před 3 hodinami

    7130th comment

  • @randomadvice2487
    @randomadvice2487 Před 3 hodinami

    Grant is the Satoshi of AI, but not...he's present.

  • @moviechilltime123
    @moviechilltime123 Před 3 hodinami

    Sorry if I'm being ignorant, but what exactly are the "charges?" I have a learning disability so sometimes i miss things even after watching several times

  • @boruut2909
    @boruut2909 Před 3 hodinami

    1, 2, 4, 8, 16 is a number sequence typically used by IQ tests. I wonder what is the correct extrapolated next number. It can actually be anything.

  • @raidtheferry
    @raidtheferry Před 4 hodinami

    Hey 3b1b, what sort of math interactive software do you use to create these amazing animations? They're awesome! You've been one of my favorite YT channels for years now and I've always wondered how it's done because I can't imagine you or someone else is doing them all by hand in the adobe suite... thx.

  • @hWat-Ever
    @hWat-Ever Před 4 hodinami

    In base2 π is 11.001 Your 16kg weight is 1000 And there are 1100 bounces Your 64kg weight is 100000 And there are 11001 bounces In base4 π is 3.02 Your 16kg weight is 10 And there are 30 bounces In base8 π is 3.11 Your 64kg weight is 100 And there are 31 bounces

  • @arnaldoleon1
    @arnaldoleon1 Před 5 hodinami

    I got nothing done at work today as I spent it all day watching your videos.

  • @chromosundrift
    @chromosundrift Před 5 hodinami

    Huge long term fan but this series is my favourite.

  • @arnaldoleon1
    @arnaldoleon1 Před 6 hodinami

    This is absolutely brilliant. Thank you so much

  • @paramrajsingh1539
    @paramrajsingh1539 Před 6 hodinami

    e and π have a cameo almost everywhere

  • @HAL-qu2ix
    @HAL-qu2ix Před 6 hodinami

    Thank you for explaining this better than anyone else has been able to. I think I finally get it. I really appreciate your content 🙌🏻

  • @alextsun7314
    @alextsun7314 Před 6 hodinami

    I don't usually comment on videos, but this is one of the best videos I've seen on transformers, extremely detailed but very easy to understand!

  • @AlejandroVales
    @AlejandroVales Před 6 hodinami

    This is actually similar to how some IQ tests work... Just trying to see how used you are to creating association patterns out of data they put out... like Finger is to hand, what leaf is to … Twig Tree Forest

  • @damianzieba5133
    @damianzieba5133 Před 7 hodinami

    That's... just insane

  • @falion609
    @falion609 Před 7 hodinami

    I REMEMBER THIS

  • @jeremyhansen9197
    @jeremyhansen9197 Před 8 hodinami

    If discreet means probability and continuous means probability density, what of we to say about the possiblity of a probability density being gaussian?

  • @omgdorkness
    @omgdorkness Před 8 hodinami

    I need you to softmax my logits, baby.

  • @jercki72
    @jercki72 Před 8 hodinami

    ahaa now for sure people found the full video link button

  • @siddharthannandhakumar6187
    @siddharthannandhakumar6187 Před 8 hodinami

    I think it still holds true even when three lines meet at a point with the fact that the area formed by them is 0.

  • @andresmunchgallardo1383
    @andresmunchgallardo1383 Před 8 hodinami

    “you, the 3-d lander” makes me feel like hes a 4d entity teaching me his version of toddler math

  • @excaliburhead
    @excaliburhead Před 9 hodinami

    I still don’t get it 🤷‍♂️

  • @gregorymathews1998
    @gregorymathews1998 Před 9 hodinami

    Love this channel but the flash bang at the end blew my pupils out

  • @tizmemc
    @tizmemc Před 9 hodinami

    I was mentally and physically abused by my father as a child, currently living a financially decrepit and this will probably continue for the forseeable future, and i still cant believe out of wverything that has ever happened to me, this is what i bet my life on and lose

  • @HarpaAI
    @HarpaAI Před 9 hodinami

    🎯 Key Takeaways for quick navigation: 00:00 *🔍 Understanding the Attention Mechanism in Transformers* - Introduction to the attention mechanism and its significance in large language models. - Overview of the goal of transformer models to predict the next word in a piece of text. - Explanation of breaking text into tokens, associating tokens with vectors, and the use of high-dimensional embeddings to encode semantic meaning. 02:11 *🧠 Contextual meaning refinement in Transformers* - Illustration of how attention mechanisms refine embeddings to encode rich contextual meaning. - Examples showcasing the updating of word embeddings based on context. - Importance of attention blocks in enriching word embeddings with contextual information. 05:37 *⚙️ Matrix operations and weighted sum in Attention* - Explanation of matrix-vector products and tunable weights in matrix operations. - Introduction to the concept of masked attention for preventing later tokens from influencing earlier ones. - Overview of attention patterns, softmax computations, and relevance weighting in attention mechanisms. 21:31 *🧠 Multi-Headed Attention Mechanism in Transformers* - Explanation of how each attention head has distinct value matrices for producing value vectors. - Introduction to the process of summing proposed changes from different heads to refine embeddings in each position. - Importance of running multiple heads in parallel to capture diverse contextual meanings efficiently. 22:34 *🛠️ Technical Details in Implementing Value Matrices* - Description of the implementation difference in the value matrices as a single output matrix. - Clarification regarding technical nuances in how value matrices are structured in practice. - Noting the distinction between value down and value up matrices commonly seen in papers and implementations. 24:03 *💡 Embedding Nuances and Capacity for Higher-Level Encoding* - Discussion on how embeddings become more nuanced as data flows through multiple transformers and layers. - Exploration of the capacity of transformers to encode complex concepts beyond surface-level descriptors. - Overview of the network parameters associated with attention heads and the total parameters devoted to the entire transformer model. Made with HARPA AI

  • @SergeyYudintsev
    @SergeyYudintsev Před 9 hodinami

    It’s insane. After watching a video with Numberfile, I actually did the exact same thing, proved that it doesn’t work for 3, moved on to 4 and stopped at coloring cause I’m dumb and couldn’t figure out the coloring

  • @ondrejbrichnac1813
    @ondrejbrichnac1813 Před 10 hodinami

    Why does it hurt my balls when it do the ,,🦆" sound

  • @AquaTeenHungerForce_4_Life
    @AquaTeenHungerForce_4_Life Před 10 hodinami

    I’m amazed that a man in the 1800s understood this and was able to explain this all before computers and quantum mechanics. I also get a kick out of the naysayers like Kelvin. 😊

  • @piotrmazgaj
    @piotrmazgaj Před 10 hodinami

    <*_> This is my seal. I have watched the entire video, understood it, and I can explain it in my own words, thus I have gained knowledge. This is my seal. <_*>

  • @DrDec0
    @DrDec0 Před 10 hodinami

    Ever calculated against eternity? You know what a circle is in other means and what a perfect circle defines? And you know what Pi was made to calculate? Now you know what number of collisions you will get the more you increase the mass of the right object to near eternity. 😘 And after you know it, eat more apples 😘

  • @thomasschodt7691
    @thomasschodt7691 Před 10 hodinami

    The last point on the perimeter needs to divide the segment not at the halfway point, but offset, say 1/3 and 2/3, creating a small figure at the centre of the circle - voila 32...

  • @user-wo6qn3vf9n
    @user-wo6qn3vf9n Před 10 hodinami

    The Fouriel transformer is different to normal transformers as instead of inline and adjacent cores and coils it is a 4 dimensional transformer consisting of 4 cores/coils at 390 degs to each other. This is much more economical than standard transformers as there is a lot less waste in heat as the electrical and magnetic waves don't interfere with each other while still inducing into each others cores. They are mainly used in Locomotive traction motors where the less heat produced reduces back emf, this was not a problem with weak fielding equipment and DC motors. With modern high voltage AC motors the heat factor is important so as much power can be driven for maximum speed.

  • @piotrmazgaj
    @piotrmazgaj Před 10 hodinami

    <*_> This is my seal. I have watched the entire video, understood it, and I can explain it in my own words, thus I have gained knowledge. This is my seal. <_*>

  • @alexjaybrady
    @alexjaybrady Před 11 hodinami

    Linquistic thermodynamics??

  • @user-ce1nq3mo6j
    @user-ce1nq3mo6j Před 11 hodinami

    Fine

  • @VandanaTripathi-hn2ix
    @VandanaTripathi-hn2ix Před 11 hodinami

    If the triangle was isosceles D and P would have coincided

  • @niktrip
    @niktrip Před 11 hodinami

    I have a fear of looking at fractals being zoomed in and I still study them, but I can't watch them

  • @kurchak
    @kurchak Před 11 hodinami

    @57:18 well I am great at using compasses but not great at math. I guess ya win some ya lose some lol.

  • @Will-fj9gy
    @Will-fj9gy Před 12 hodinami

    This is terrifying

  • @Stanley-Wallice
    @Stanley-Wallice Před 12 hodinami

    what is this supposed to be? edgy or something? are you having a stroke?

  • @maxwvm7345
    @maxwvm7345 Před 12 hodinami

    i love this series. i did a lot of mallicious promt trial and error; but learning more about the mathematics behind it, i get to understand how some things might work.

  • @GoosebumpsOrg
    @GoosebumpsOrg Před 12 hodinami

    9:36 - 3Violet 1Brown !!

  • @ujjwalyadav6189
    @ujjwalyadav6189 Před 12 hodinami

    I was just stuck with this topic not even my college professors were explaining it nicely then I found you. You owe a salute sir 🙇🙇

  • @nicezombie8054
    @nicezombie8054 Před 13 hodinami

    The fourth level is explaining it to someone else, as that's always one of the hardest things to do and the best indication you truly know the concept.

  • @anaghpandey8805
    @anaghpandey8805 Před 13 hodinami

    You're a GOD

  • @virus404ripoff
    @virus404ripoff Před 13 hodinami

    eeeeeeeee..

  • @Kate-R20
    @Kate-R20 Před 13 hodinami

    Puts it on its side 😊

  • @seth5119
    @seth5119 Před 13 hodinami

    So pi just pops up in the most unlikely of places......sensational

  • @andreizelenco4164
    @andreizelenco4164 Před 13 hodinami

    Because of parallel processing and GPUs you can convolve say a 1k image with a 3*3 kernel by offsetting the image one pixel to NW and multiply it by the first element in the kernel, then offset to N and multiply the image by the second element in the kernel an so on in all 8 directions + the center. Then you just add all the 9 images. That is also very fast and it works because of parallel processing.