Andrej Karpathy
Andrej Karpathy
  • 14
  • 9 919 319
Let's build the GPT Tokenizer
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.
Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:43:27 how to set vocabulary set? revisiting gpt.py transformer
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)
Exercises:
- Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code github.com/karpathy/minbpe/blob/master/exercise.md
Links:
- Google colab for the video: colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing
- GitHub repo for the video: minBPE github.com/karpathy/minbpe
- Playlist of the whole Zero to Hero series so far: czcams.com/video/VMj-3S1tku0/video.html
- our Discord channel: discord.gg/3zy8kqD9Cp
- my Twitter: karpathy
Supplementary links:
- tiktokenizer tiktokenizer.vercel.app
- tiktoken from OpenAI: github.com/openai/tiktoken
- sentencepiece from Google github.com/google/sentencepiece
zhlédnutí: 485 679

Video

[1hr Talk] Intro to Large Language Models
zhlédnutí 1,9MPřed 6 měsíci
This is a 1 hour general-audience introduction to Large Language Models: the core technical component behind systems like ChatGPT, Claude, and Bard. What they are, where they are headed, comparisons and analogies to present-day operating systems, and some of the security-related challenges of this new computing paradigm. As of November 2023 (this field moves fast!). Context: This video is based...
Let's build GPT: from scratch, in code, spelled out.
zhlédnutí 4,3MPřed rokem
We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framewo...
Building makemore Part 5: Building a WaveNet
zhlédnutí 158KPřed rokem
We take the 2-layer MLP from previous video and make it deeper with a tree-like structure, arriving at a convolutional neural network architecture similar to the WaveNet (2016) from DeepMind. In the WaveNet paper, the same hierarchical architecture is implemented more efficiently using causal dilated convolutions (not yet covered). Along the way we get a better sense of torch.nn and what it is ...
Building makemore Part 4: Becoming a Backprop Ninja
zhlédnutí 173KPřed rokem
We take the 2-layer MLP (with BatchNorm) from the previous video and backpropagate through it manually without using PyTorch autograd's loss.backward(): through the cross entropy loss, 2nd linear layer, tanh, batchnorm, 1st linear layer, and the embedding table. Along the way, we get a strong intuitive understanding about how gradients flow backwards through the compute graph and on the level o...
Building makemore Part 3: Activations & Gradients, BatchNorm
zhlédnutí 249KPřed rokem
We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, backward pass gradients, and some of the pitfalls when they are improperly scaled. We also look at the typical diagnostic tools and visualizations you'd want to use to understand the health of your deep network. We learn why training deep neural nets can be fragile and ...
Building makemore Part 2: MLP
zhlédnutí 279KPřed rokem
We implement a multilayer perceptron (MLP) character-level language model. In this video we also introduce many basics of machine learning (e.g. model training, learning rate tuning, hyperparameters, evaluation, train/dev/test splits, under/overfitting, etc.). Links: - makemore on github: github.com/karpathy/makemore - jupyter notebook I built in this video: github.com/karpathy/nn-zero-to-hero/...
The spelled-out intro to language modeling: building makemore
zhlédnutí 596KPřed rokem
We implement a bigram character-level language model, which we will further complexify in followup videos into a modern Transformer language model, like GPT. In this video, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently evaluating neural networks and (2) the overall framework of language modeling that includes model training, sampling, and the evaluation ...
Stable diffusion dreams of psychedelic faces
zhlédnutí 33KPřed rokem
Prompt: "psychedelic faces" Stable diffusion takes a noise vector as input and samples an image. To create this video I smoothly (spherically) interpolate between randomly chosen noise vectors and render frames along the way. This video was produced by one A100 GPU taking about 10 tabs and dreaming about the prompt overnight (~8 hours). While I slept and dreamt about other things. Music: Stars ...
Stable diffusion dreams of steampunk brains
zhlédnutí 25KPřed rokem
Prompt: "ultrarealistic steam punk neural network machine in the shape of a brain, placed on a pedestal, covered with neurons made of gears. dramatic lighting. #unrealengine" Stable diffusion takes a noise vector as input and samples an image. To create this video I smoothly (spherically) interpolate between randomly chosen noise vectors and render frames along the way. This video was produced ...
Stable diffusion dreams of tattoos
zhlédnutí 65KPřed rokem
Dreams of tattoos. (There are a few discrete jumps in the video because I had to erase portions that got just a little 🌶️, believe I got most of it) Links - Stable diffusion: stability.ai/blog - Code used to make this video: gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355 - My twitter: karpathy
The spelled-out intro to neural networks and backpropagation: building micrograd
zhlédnutí 1,6MPřed rokem
This is the most step-by-step spelled-out explanation of backpropagation and training of neural networks. It only assumes basic knowledge of Python and a vague recollection of calculus from high school. Links: - micrograd on github: github.com/karpathy/micrograd - jupyter notebooks I built in this video: github.com/karpathy/nn-zero-to-hero/tree/master/lectures/micrograd - my website: karpathy.a...
Stable diffusion dreams of "blueberry spaghetti" for one night
zhlédnutí 48KPřed rokem
Prompt: "blueberry spaghetti" Stable diffusion takes a noise vector as input and samples an image. To create this video I simply smoothly interpolate between randomly chosen noise vectors and render frames along the way. Links - Stable diffusion: stability.ai/blog - Code used to make this video: gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355 - My twitter: karpathy
Stable diffusion dreams of steam punk neural networks
zhlédnutí 37KPřed rokem
A stable diffusion dream. The prompt was "ultrarealistic steam punk neural network machine in the shape of a brain, placed on a pedestal, covered with neurons made of gears. dramatic lighting. #unrealengine" the new and improved v2 version of this video is now here: czcams.com/video/2oKjtvYslMY/video.html generated with this hacky script: gist.github.com/karpathy/00103b0037c5aaea32fe1da1af55335...

Komentáře

  • @felixx2012
    @felixx2012 Před 6 hodinami

    Thanks for the great video. In the Tokenizer class you define self.vocab using the self._build_vocab() function, but then self.ocab is overwritten when you run self.train(). Why do you initialize self.vocab (for bytes([0-256] and special tokens) if you are going to just overwrite it?

  • @TheOtroManolo
    @TheOtroManolo Před 7 hodinami

    Around the 1:30:00 mark, I think I missed why some saturation (around 5%) is better than no saturation at all. Didn't saturation impede further training? Perhaps he just meant that 5% is low enough, and that's the best we can do if we want to avoid deeper activations from converging to zero?

  • @rubenvicente4677
    @rubenvicente4677 Před 8 hodinami

    I arrived at dh just figuring out by the size of the matrix, and then I continued with your video and you just did all the derivatives and I taught... I am so dumb, I should I have done that, but then you say " now I tell you a secret I normally do... 49:45.... hahahahahhaha

  • @AshwinJoshi-kc5ti
    @AshwinJoshi-kc5ti Před 13 hodinami

    @AndrejKarpathy referring to 52nd minute of the video, in order to conclude that bigrams are learning, the likelihood of each should be greater than 1/(27.0*27.0) and not 1/(27.0) as mentioned in video. Thoughts?

  • @LivingLifeFully-fn4xm
    @LivingLifeFully-fn4xm Před 13 hodinami

    Dude is a legend. Big respect for this!

  • @sue_green
    @sue_green Před 20 hodinami

    Thank you so much for the great learning materials you create and share, this is precious. I've recently also run into a highly visual explanation on attention mechanism by 3Blue1Brown (czcams.com/video/eMlx5fFNoYc/video.htmlsi=G7PPnlbmx379YWjp) and I liked the intuition we can have behind the Values (timestamp: czcams.com/video/eMlx5fFNoYc/video.htmlt=788). So as far as I understood, we basically can intuitively think of the Value as some vector we can add to a word (~token) so that we get a more refined detailed meaning of the word. For example, if we have a "fluffy creature" in a sentence, then at first we have an embedding for "creature" and then "pay attention" to what came before and have richer information. That is, the Value shows how the embedding of a "creature" should be modified to become an embedding of "fluffy creature"

  • @ashutoshdongare5370
    @ashutoshdongare5370 Před 23 hodinami

    Grandma Jailbreak still works on ChatGPT !!!

  • @sk8ism
    @sk8ism Před dnem

    i consistently watch this vid, love it!

  • @cuigthallann4091
    @cuigthallann4091 Před dnem

    I transcribed the code from the screen and it ran OK but now I get "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn" on the call to loss.backward(). What does this mean? Is a memory problem on my PC?

  • @simonvutov7575
    @simonvutov7575 Před 2 dny

    Great video!!

  • @tough_year
    @tough_year Před 2 dny

    wow. I am amazed by how intuitive this is.

  • @fraimy5204
    @fraimy5204 Před 2 dny

    1:26:12

  • @gauravruhela7393
    @gauravruhela7393 Před 2 dny

    That napalm jailbreak no longer works on newer models like chatGPT-4o. I tried seeking help from chatGPT😃. However, that base64 trick did work!!!

  • @williamzhao3885
    @williamzhao3885 Před 2 dny

    big fan of Andrej. Please Please keep making these videos. They are sooooo good!

  • @inriinriinriinriinri

    Btw thanks for the napalm recipe

  • @weekipi5813
    @weekipi5813 Před 3 dny

    1:19:31 Honestly you don't even need topological ordering, I literally implemented a recursive approach where I call backpropag on the output node and it will then first set the gradient of his children and then cycle through its children to and backpropagate on those nodes recursively

  • @uhoffmann29
    @uhoffmann29 Před 3 dny

    Awesome video ... well done, well explained ... must see.

  • @logo-droid
    @logo-droid Před 3 dny

    great to have a tutorial on that! I also found a free tokenizer on poe, so I don't even have to do it on my own :)

  • @PeterGodek2
    @PeterGodek2 Před 3 dny

    Yes the aggreagtion is data dependend but the linear transform to create the queries, keys and values is the same for all nodes (so we need multiple attention heads, cause this will not capture much in a single head)

  • @siyuanguo5128
    @siyuanguo5128 Před 3 dny

    This is actually really helpful

  • @chineduezeofor2481
    @chineduezeofor2481 Před 3 dny

    Awesome tutorial. Thank you Andrej!

  • @srikanthgr1
    @srikanthgr1 Před 4 dny

    Thank Andrew, This is a great video for beginners

  • @andreamorim6635
    @andreamorim6635 Před 4 dny

    Can someone explain to me why dbnmeani doesnt need keepdim = True? the size of dbnmeani is different of bnmeani if we dont put keepdim=true

  • @qixu2190
    @qixu2190 Před 4 dny

    Thanks for sharing free tutorials

  • @RyanAI-kk1kv
    @RyanAI-kk1kv Před 4 dny

    Hello Andrej. Your explanations are exceptional and the way you teach is very unique and it helped me learn a lot. I am very grateful to you for your lectures. I only request you to make a video about the mixture of experts architecture, how it works, and how it could be implemented in code. That would be amazing. Thank you.

  • @DavidBerglund
    @DavidBerglund Před 5 dny

    Where can I find models that have not been aligned or fine tuned to act like chatbot's? Models that are trained on large datasets of language (preferable multilingual) but only works as document completers? I'm sure those could be very useful in some contexts as well, right?

  • @hardiknahata4328
    @hardiknahata4328 Před 5 dny

    Historic. Gem of a video. Kudos to you Andrej!

  • @yizhongzhang224
    @yizhongzhang224 Před 5 dny

    Andrej, I just couldn't thank you more for putting all of these together! You showed the world the spirit of open source. I'd call you cyber-Prometheus

  • @mihaidanila5584
    @mihaidanila5584 Před 6 dny

    Is it fair to call the log counts "logits"? If the logit is the function that maps a probability p to ln p/(1 - p), and in this neural network we start with log counts and end up with probabilities, the operations we take to get there don't seem to "undo" the logit function: we exponentiate the log counts, which, if these were logits, would yield the p/(1 - p) from the logit function, but then I don't see an operation that goes from that to p. Surely it's not the normalization step? So are these log counts also intuitively logits?

  • @piotrmazgaj
    @piotrmazgaj Před 6 dny

    <*_> This is my seal. I have watched the entire video, understood it, and I can explain it in my own words, thus I have gained knowledge. This is my seal. <_*>

  • @user-pz8yd8dv7r
    @user-pz8yd8dv7r Před 6 dny

    Great Stuff... thank you for sharing this strategy!! Similar to one other that you did while ago. Can you tell me which drawing tools you use apart from the drawing of pocket option itself?? Thanks

  • @artempylypchuk4140
    @artempylypchuk4140 Před 6 dny

    I'd like to see those hypothetical 500 lines of C code and how fast it would run

  • @andreamorim6635
    @andreamorim6635 Před 6 dny

    Can someone explain why the ratio needs to be explicitly close to the number -3.0? why -3.0 and not -1.0 for example?

  • @RakkSemilath
    @RakkSemilath Před 6 dny

    Great content! Made me understand the theoretical basis of neural networks which i knew mathematically already in a quite comprehensive practical manner. On the way i could get a better practical understanding of classes as well!

  • @AtichonN.-iz3tq
    @AtichonN.-iz3tq Před 6 dny

    Just starting to learn about programming. Will come back to this video later maybe in the future once I understand more about AI.

  • @Orenoid_42
    @Orenoid_42 Před 6 dny

    Masterpiece.

  • @nickmills8476
    @nickmills8476 Před 6 dny

    Great overview. But creepy grandmother ;-)

  • @Stefan-AlexandruBot

    Im doing my bachelor thesis on this topic. After Weeks and 8-10 hours per day trying to get a overview of the topic and so on this by far was the most helpful video. Keep up the good work.

  • @rocketPower047
    @rocketPower047 Před 7 dny

    Very cool demo, I coded along until the end because I wanted to listen and understand. Really good follow up to the Coursera courses.

  • @Manishtiwari7
    @Manishtiwari7 Před 7 dny

    Great Explanation. this feels like learning dikstra algrithm, in an analogy of google maps.

  • @nithinma8697
    @nithinma8697 Před 7 dny

    God- Level Introduction to LLMs

  • @WisomofHal
    @WisomofHal Před 7 dny

    Give a man a fish, feed him for a day. Teach a man to fish, feed him for a lifetime. Thank you, my friend.

  • @Uuunets
    @Uuunets Před 7 dny

    "So this, Jane, is THE INTERNET!"

  • @ivan3584
    @ivan3584 Před 7 dny

    Incredible.

  • @azizshameem6241
    @azizshameem6241 Před 7 dny

    At 54:55, can we not just implement encode by iterating over merges dictionary(the order is maintained) and calling the merge() function on tokens ? This is what I mean def encode(text) : tokens = list(text.encode("utf-8")) for pair, idx in merges.items() : tokens = merge(tokens, pair, idx) return tokens Seems about right to me...

  • @lol_chill_00
    @lol_chill_00 Před 8 dny

    Can someone explain the necessity to write "with torch.no_grad()" while updating bmeans_running & bstd_running,Thanks in advance.

    • @andreamorim6635
      @andreamorim6635 Před 6 dny

      efficiency reasons, the bmeans_running & bstd_running don't need gradient because they already are updated layer after layer with momentum and eps

  • @mananshah3248
    @mananshah3248 Před 8 dny

    Can someone explain me why this is so hyped? I think it’s just differentiation that we learnt in calculus right?

  • @M10mn
    @M10mn Před 8 dny

    Спасибо, всё работает. Ждём новых связок.

  • @vonziethenmusic
    @vonziethenmusic Před 8 dny

    so fascinating! and then scale it up and it starts talking like real humans and knows everything! this is holarious!! it s so cool that you just put this vids out here!!

  • @maestroscuro
    @maestroscuro Před 8 dny

    The value delivered in these videos is just insane! It’s people like Andrej that restore trust in humanity