- 175
- 498 551 867
3Blue1Brown
United States
Registrace 3. 03. 2015
My name is Grant Sanderson. Videos here cover a variety of topics in math, or adjacent fields like physics and CS, all with an emphasis on visualizing the core ideas. The goal is to use animation to help elucidate and motivate otherwise tricky topics, and for difficult problems to be made simple with changes in perspective.
For more information, other projects, FAQs, and inquiries see the website: www.3blue1brown.com
For more information, other projects, FAQs, and inquiries see the website: www.3blue1brown.com
Attention in transformers, visually explained | Chapter 6, Deep Learning
Demystifying attention, the key mechanism inside transformers and LLMs.
Instead of sponsored ad reads, these lessons are funded directly by viewers: 3b1b.co/support
Special thanks to these supporters: www.3blue1brown.com/lessons/attention#thanks
An equally valuable form of support is to simply share the videos.
Demystifying self-attention, multiple heads, and cross-attention.
Instead of sponsored ad reads, these lessons are funded directly by viewers: 3b1b.co/support
The first pass for the translated subtitles here is machine-generated, and therefore notably imperfect. To contribute edits or fixes, visit translate.3blue1brown.com/
And yes, at 22:00 (and elsewhere), "breaks" is a typo.
------------------
Here are a few other relevant resources
Build a GPT from scratch, by Andrej Karpathy
czcams.com/video/kCc8FmEb1nY/video.html
If you want a conceptual understanding of language models from the ground up, @vcubingx just started a short series of videos on the topic:
czcams.com/video/1il-s4mgNdI/video.htmlsi=XaVxj6bsdy3VkgEX
If you're interested in the herculean task of interpreting what these large networks might actually be doing, the Transformer Circuits posts by Anthropic are great. In particular, it was only after reading one of these that I started thinking of the combination of the value and output matrices as being a combined low-rank map from the embedding space to itself, which, at least in my mind, made things much clearer than other sources.
transformer-circuits.pub/2021/framework/index.html
Site with exercises related to ML programming and GPTs
www.gptandchill.ai/codingproblems
History of language models by Brit Cruise, @ArtOfTheProblem
czcams.com/video/OFS90-FX6pg/video.html
An early paper on how directions in embedding spaces have meaning:
arxiv.org/pdf/1301.3781.pdf
------------------
Timestamps:
0:00 - Recap on embeddings
1:39 - Motivating examples
4:29 - The attention pattern
11:08 - Masking
12:42 - Context size
13:10 - Values
15:44 - Counting parameters
18:21 - Cross-attention
19:19 - Multiple heads
22:16 - The output matrix
23:19 - Going deeper
24:54 - Ending
------------------
These animations are largely made using a custom Python library, manim. See the FAQ comments here:
3b1b.co/faq#manim
github.com/3b1b/manim
github.com/ManimCommunity/manim/
All code for specific videos is visible here:
github.com/3b1b/videos/
The music is by Vincent Rubinetti.
www.vincentrubinetti.com
vincerubinetti.bandcamp.com/album/the-music-of-3blue1brown
open.spotify.com/album/1dVyjwS8FBqXhRunaG5W5u
------------------
3blue1brown is a channel about animating math, in all senses of the word animate. If you're reading the bottom of a video description, I'm guessing you're more interested than the average viewer in lessons here. It would mean a lot to me if you chose to stay up to date on new ones, either by subscribing here on CZcams or otherwise following on whichever platform below you check most regularly.
Mailing list: 3blue1brown.substack.com
Twitter: 3blue1brown
Instagram: 3blue1brown
Reddit: www.reddit.com/r/3blue1brown
Facebook: 3blue1brown
Patreon: patreon.com/3blue1brown
Website: www.3blue1brown.com
Instead of sponsored ad reads, these lessons are funded directly by viewers: 3b1b.co/support
Special thanks to these supporters: www.3blue1brown.com/lessons/attention#thanks
An equally valuable form of support is to simply share the videos.
Demystifying self-attention, multiple heads, and cross-attention.
Instead of sponsored ad reads, these lessons are funded directly by viewers: 3b1b.co/support
The first pass for the translated subtitles here is machine-generated, and therefore notably imperfect. To contribute edits or fixes, visit translate.3blue1brown.com/
And yes, at 22:00 (and elsewhere), "breaks" is a typo.
------------------
Here are a few other relevant resources
Build a GPT from scratch, by Andrej Karpathy
czcams.com/video/kCc8FmEb1nY/video.html
If you want a conceptual understanding of language models from the ground up, @vcubingx just started a short series of videos on the topic:
czcams.com/video/1il-s4mgNdI/video.htmlsi=XaVxj6bsdy3VkgEX
If you're interested in the herculean task of interpreting what these large networks might actually be doing, the Transformer Circuits posts by Anthropic are great. In particular, it was only after reading one of these that I started thinking of the combination of the value and output matrices as being a combined low-rank map from the embedding space to itself, which, at least in my mind, made things much clearer than other sources.
transformer-circuits.pub/2021/framework/index.html
Site with exercises related to ML programming and GPTs
www.gptandchill.ai/codingproblems
History of language models by Brit Cruise, @ArtOfTheProblem
czcams.com/video/OFS90-FX6pg/video.html
An early paper on how directions in embedding spaces have meaning:
arxiv.org/pdf/1301.3781.pdf
------------------
Timestamps:
0:00 - Recap on embeddings
1:39 - Motivating examples
4:29 - The attention pattern
11:08 - Masking
12:42 - Context size
13:10 - Values
15:44 - Counting parameters
18:21 - Cross-attention
19:19 - Multiple heads
22:16 - The output matrix
23:19 - Going deeper
24:54 - Ending
------------------
These animations are largely made using a custom Python library, manim. See the FAQ comments here:
3b1b.co/faq#manim
github.com/3b1b/manim
github.com/ManimCommunity/manim/
All code for specific videos is visible here:
github.com/3b1b/videos/
The music is by Vincent Rubinetti.
www.vincentrubinetti.com
vincerubinetti.bandcamp.com/album/the-music-of-3blue1brown
open.spotify.com/album/1dVyjwS8FBqXhRunaG5W5u
------------------
3blue1brown is a channel about animating math, in all senses of the word animate. If you're reading the bottom of a video description, I'm guessing you're more interested than the average viewer in lessons here. It would mean a lot to me if you chose to stay up to date on new ones, either by subscribing here on CZcams or otherwise following on whichever platform below you check most regularly.
Mailing list: 3blue1brown.substack.com
Twitter: 3blue1brown
Instagram: 3blue1brown
Reddit: www.reddit.com/r/3blue1brown
Facebook: 3blue1brown
Patreon: patreon.com/3blue1brown
Website: www.3blue1brown.com
zhlédnutí: 840 279
Video
But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
zhlédnutí 2,2MPřed měsícem
Unpacking how large language models work under the hood Early view of the next chapter for patrons: 3b1b.co/early-attention Special thanks to these supporters: 3b1b.co/lessons/gpt#thanks To contribute edits to the subtitles, visit translate.3blue1brown.com/ Other recommended resources on the topic. Richard Turner's introduction is one of the best starting places: arxiv.org/pdf/2304.10557.pdf Co...
4 questions about the refractive index | Optics puzzles 4
zhlédnutí 644KPřed 4 měsíci
4 questions about the refractive index | Optics puzzles 4
But why would light "slow down"? | Optics puzzles 3
zhlédnutí 1,2MPřed 5 měsíci
But why would light "slow down"? | Optics puzzles 3
25 Math explainers you may enjoy | SoME3 results
zhlédnutí 542KPřed 6 měsíci
25 Math explainers you may enjoy | SoME3 results
Explaining the barber pole effect from origins of light | Optics puzzles 2
zhlédnutí 674KPřed 8 měsíci
Explaining the barber pole effect from origins of light | Optics puzzles 2
Polarized light in sugar water | Optics puzzles 1
zhlédnutí 998KPřed 8 měsíci
Polarized light in sugar water | Optics puzzles 1
A pretty reason why Gaussian + Gaussian = Gaussian
zhlédnutí 751KPřed 9 měsíci
A pretty reason why Gaussian Gaussian = Gaussian
This pattern breaks, but for a good reason | Moser's circle problem
zhlédnutí 1,9MPřed 10 měsíci
This pattern breaks, but for a good reason | Moser's circle problem
How They Fool Ya (live) | Math parody of Hallelujah
zhlédnutí 940KPřed 10 měsíci
How They Fool Ya (live) | Math parody of Hallelujah
Convolutions | Why X+Y in probability is a beautiful mess
zhlédnutí 625KPřed 10 měsíci
Convolutions | Why X Y in probability is a beautiful mess
Why π is in the normal distribution (beyond integral tricks)
zhlédnutí 1,5MPřed rokem
Why π is in the normal distribution (beyond integral tricks)
But what is the Central Limit Theorem?
zhlédnutí 3,3MPřed rokem
But what is the Central Limit Theorem?
Researchers thought this was a bug (Borwein integrals)
zhlédnutí 3,3MPřed rokem
Researchers thought this was a bug (Borwein integrals)
What makes a great math explanation? | SoME2 results
zhlédnutí 735KPřed rokem
What makes a great math explanation? | SoME2 results
Olympiad level counting (Generating functions)
zhlédnutí 1,9MPřed rokem
Olympiad level counting (Generating functions)
Oh, wait, actually the best Wordle opener is not “crane”…
zhlédnutí 6MPřed 2 lety
Oh, wait, actually the best Wordle opener is not “crane”…
Solving Wordle using information theory
zhlédnutí 10MPřed 2 lety
Solving Wordle using information theory
A tale of two problem solvers (Average cube shadows)
zhlédnutí 2,7MPřed 2 lety
A tale of two problem solvers (Average cube shadows)
2021 Summer of Math Exposition results
zhlédnutí 776KPřed 2 lety
2021 Summer of Math Exposition results
Beyond the Mandelbrot set, an intro to holomorphic dynamics
zhlédnutí 1,4MPřed 2 lety
Beyond the Mandelbrot set, an intro to holomorphic dynamics
From Newton’s method to Newton’s fractal (which Newton knew nothing about)
zhlédnutí 2,8MPřed 2 lety
From Newton’s method to Newton’s fractal (which Newton knew nothing about)
A quick trick for computing eigenvalues | Chapter 15, Essence of linear algebra
zhlédnutí 978KPřed 2 lety
A quick trick for computing eigenvalues | Chapter 15, Essence of linear algebra
How (and why) to raise e to the power of a matrix | DE6
zhlédnutí 2,7MPřed 3 lety
How (and why) to raise e to the power of a matrix | DE6
The medical test paradox, and redesigning Bayes' rule
zhlédnutí 1,2MPřed 3 lety
The medical test paradox, and redesigning Bayes' rule
Hamming codes part 2: The one-line implementation
zhlédnutí 834KPřed 3 lety
Hamming codes part 2: The one-line implementation
But what are Hamming codes? The origin of error correction
zhlédnutí 2,3MPřed 3 lety
But what are Hamming codes? The origin of error correction
Highly recommend for any classical mechanic enthusiasts. Great video.
7130th comment
Grant is the Satoshi of AI, but not...he's present.
Sorry if I'm being ignorant, but what exactly are the "charges?" I have a learning disability so sometimes i miss things even after watching several times
1, 2, 4, 8, 16 is a number sequence typically used by IQ tests. I wonder what is the correct extrapolated next number. It can actually be anything.
Hey 3b1b, what sort of math interactive software do you use to create these amazing animations? They're awesome! You've been one of my favorite YT channels for years now and I've always wondered how it's done because I can't imagine you or someone else is doing them all by hand in the adobe suite... thx.
In base2 π is 11.001 Your 16kg weight is 1000 And there are 1100 bounces Your 64kg weight is 100000 And there are 11001 bounces In base4 π is 3.02 Your 16kg weight is 10 And there are 30 bounces In base8 π is 3.11 Your 64kg weight is 100 And there are 31 bounces
I got nothing done at work today as I spent it all day watching your videos.
Huge long term fan but this series is my favourite.
This is absolutely brilliant. Thank you so much
e and π have a cameo almost everywhere
Thank you for explaining this better than anyone else has been able to. I think I finally get it. I really appreciate your content 🙌🏻
I don't usually comment on videos, but this is one of the best videos I've seen on transformers, extremely detailed but very easy to understand!
This is actually similar to how some IQ tests work... Just trying to see how used you are to creating association patterns out of data they put out... like Finger is to hand, what leaf is to … Twig Tree Forest
That's... just insane
I REMEMBER THIS
If discreet means probability and continuous means probability density, what of we to say about the possiblity of a probability density being gaussian?
I need you to softmax my logits, baby.
ahaa now for sure people found the full video link button
I think it still holds true even when three lines meet at a point with the fact that the area formed by them is 0.
“you, the 3-d lander” makes me feel like hes a 4d entity teaching me his version of toddler math
I still don’t get it 🤷♂️
Love this channel but the flash bang at the end blew my pupils out
I was mentally and physically abused by my father as a child, currently living a financially decrepit and this will probably continue for the forseeable future, and i still cant believe out of wverything that has ever happened to me, this is what i bet my life on and lose
🎯 Key Takeaways for quick navigation: 00:00 *🔍 Understanding the Attention Mechanism in Transformers* - Introduction to the attention mechanism and its significance in large language models. - Overview of the goal of transformer models to predict the next word in a piece of text. - Explanation of breaking text into tokens, associating tokens with vectors, and the use of high-dimensional embeddings to encode semantic meaning. 02:11 *🧠 Contextual meaning refinement in Transformers* - Illustration of how attention mechanisms refine embeddings to encode rich contextual meaning. - Examples showcasing the updating of word embeddings based on context. - Importance of attention blocks in enriching word embeddings with contextual information. 05:37 *⚙️ Matrix operations and weighted sum in Attention* - Explanation of matrix-vector products and tunable weights in matrix operations. - Introduction to the concept of masked attention for preventing later tokens from influencing earlier ones. - Overview of attention patterns, softmax computations, and relevance weighting in attention mechanisms. 21:31 *🧠 Multi-Headed Attention Mechanism in Transformers* - Explanation of how each attention head has distinct value matrices for producing value vectors. - Introduction to the process of summing proposed changes from different heads to refine embeddings in each position. - Importance of running multiple heads in parallel to capture diverse contextual meanings efficiently. 22:34 *🛠️ Technical Details in Implementing Value Matrices* - Description of the implementation difference in the value matrices as a single output matrix. - Clarification regarding technical nuances in how value matrices are structured in practice. - Noting the distinction between value down and value up matrices commonly seen in papers and implementations. 24:03 *💡 Embedding Nuances and Capacity for Higher-Level Encoding* - Discussion on how embeddings become more nuanced as data flows through multiple transformers and layers. - Exploration of the capacity of transformers to encode complex concepts beyond surface-level descriptors. - Overview of the network parameters associated with attention heads and the total parameters devoted to the entire transformer model. Made with HARPA AI
It’s insane. After watching a video with Numberfile, I actually did the exact same thing, proved that it doesn’t work for 3, moved on to 4 and stopped at coloring cause I’m dumb and couldn’t figure out the coloring
Why does it hurt my balls when it do the ,,🦆" sound
I’m amazed that a man in the 1800s understood this and was able to explain this all before computers and quantum mechanics. I also get a kick out of the naysayers like Kelvin. 😊
<*_> This is my seal. I have watched the entire video, understood it, and I can explain it in my own words, thus I have gained knowledge. This is my seal. <_*>
Ever calculated against eternity? You know what a circle is in other means and what a perfect circle defines? And you know what Pi was made to calculate? Now you know what number of collisions you will get the more you increase the mass of the right object to near eternity. 😘 And after you know it, eat more apples 😘
The last point on the perimeter needs to divide the segment not at the halfway point, but offset, say 1/3 and 2/3, creating a small figure at the centre of the circle - voila 32...
The Fouriel transformer is different to normal transformers as instead of inline and adjacent cores and coils it is a 4 dimensional transformer consisting of 4 cores/coils at 390 degs to each other. This is much more economical than standard transformers as there is a lot less waste in heat as the electrical and magnetic waves don't interfere with each other while still inducing into each others cores. They are mainly used in Locomotive traction motors where the less heat produced reduces back emf, this was not a problem with weak fielding equipment and DC motors. With modern high voltage AC motors the heat factor is important so as much power can be driven for maximum speed.
<*_> This is my seal. I have watched the entire video, understood it, and I can explain it in my own words, thus I have gained knowledge. This is my seal. <_*>
Linquistic thermodynamics??
Fine
If the triangle was isosceles D and P would have coincided
I have a fear of looking at fractals being zoomed in and I still study them, but I can't watch them
@57:18 well I am great at using compasses but not great at math. I guess ya win some ya lose some lol.
This is terrifying
what is this supposed to be? edgy or something? are you having a stroke?
i love this series. i did a lot of mallicious promt trial and error; but learning more about the mathematics behind it, i get to understand how some things might work.
9:36 - 3Violet 1Brown !!
I was just stuck with this topic not even my college professors were explaining it nicely then I found you. You owe a salute sir 🙇🙇
The fourth level is explaining it to someone else, as that's always one of the hardest things to do and the best indication you truly know the concept.
You're a GOD
eeeeeeeee..
Puts it on its side 😊
So pi just pops up in the most unlikely of places......sensational
Because of parallel processing and GPUs you can convolve say a 1k image with a 3*3 kernel by offsetting the image one pixel to NW and multiply it by the first element in the kernel, then offset to N and multiply the image by the second element in the kernel an so on in all 8 directions + the center. Then you just add all the 9 images. That is also very fast and it works because of parallel processing.