Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Sdílet
Vložit
  • čas přidán 28. 06. 2024
  • In this video I will introduce and explain quantization: we will first start with a little introduction on numerical representation of integers and floating-point numbers in computers, then see what is quantization and how it works. I will explore topics like Asymmetric and Symmetric Quantization, Quantization Range, Quantization Granularity, Dynamic and Static Quantization, Post-Training Quantization and Quantization-Aware Training.
    Code: github.com/hkproj/quantizatio...
    PDF slides: github.com/hkproj/quantizatio...
    Chapters
    00:00 - Introduction
    01:10 - What is quantization?
    03:42 - Integer representation
    07:25 - Floating-point representation
    09:16 - Quantization (details)
    13:50 - Asymmetric vs Symmetric Quantization
    15:38 - Asymmetric Quantization
    18:34 - Symmetric Quantization
    20:57 - Asymmetric vs Symmetric Quantization (Python Code)
    24:16 - Dynamic Quantization & Calibration
    27:57 - Multiply-Accumulate Block
    30:05 - Range selection strategies
    34:40 - Quantization granularity
    35:49 - Post-Training Quantization
    43:05 - Training-Aware Quantization
  • Věda a technologie

Komentáře • 72

  • @zendr0
    @zendr0 Před 6 měsíci +32

    If you are not aware let me tell you. You are helping a generation of ML practitioners learn all this for free. Huge respect to you Umar. Thank you for all your hard work ❤

    • @savvysuraj
      @savvysuraj Před 4 měsíci

      The content made by Umar is helping me alot.Kudos to Umar.

  • @vik2189
    @vik2189 Před 2 měsíci +3

    Fantastic video! Probably the best 50 minutes spent on AI related concepts in the past 1 year or so.

  • @dariovicenzo8139
    @dariovicenzo8139 Před 2 měsíci +3

    Great job, in particular the examples regarding the conversion from/to integer not only with formulas but with true numbers too!

  • @ankush4617
    @ankush4617 Před 6 měsíci +10

    I keep hearing about quantization so much, this is the first time i have seen someone go so deep into this topic and come up with such clear explanations! Keep up all your great work, you are a gem to the AI community!!
    I’m hoping that you will have a video on Mixtral MoE soon 😊

    • @umarjamilai
      @umarjamilai  Před 6 měsíci

      You read my mind about Mistral. Stay tuned! 😺

    • @ankush4617
      @ankush4617 Před 6 měsíci

      @@umarjamilai❤

  • @krystofjakubek9376
    @krystofjakubek9376 Před 6 měsíci +7

    Great video!
    Just a clarification: on modern processors floating point operations are NOT slower than integer operations. It very much depends on the exact processor and even then the difference is usually extremely small compared to the other overheads of executing the code.
    HOWEVER the reduction of size from 32 bit float to 8 bit integer does itself make the operations faster a lot faster. The cause is two fold:
    1) modern CPUs and GPUs are typically memory bound and so simply put if we reduce the amount of data the processor needs to load in by 4x we expect the time the processor spends waiting for another set of data to come by to shrink by 4x as well.
    2) pretty much all machine learning code is vectorized. This means the processor instead of executing each instruction on a single number grabs N numbers and executes the instruction on all of them at once (SIMD instructions).
    However most processors dont have N set instead have set the total number of bits all N numbers occupy (for example AVX2 can do operations on 256 bits at a time) so if we go from 32 bits to 8 bits we can do 4x more instructions at once! This is likely what you mean by operations being faster.
    Note thag CPUs or GPUs are very much similar in this regard, only GPUs have much more SIMD lanes (much more bits).

    • @umarjamilai
      @umarjamilai  Před 6 měsíci +2

      Thanks for the clarification! I was even going to talk about the internal hardware of adders (Carry-lookahead adder) to show how a simple operation like addition works and compare it with the many steps required for the floating-point number (which also involves normalization). You explanation nailed it! Thanks again!

  • @asra1kumar
    @asra1kumar Před 3 měsíci +1

    This channel features exceptional lectures, and the quality of explanation is truly outstanding. 👌

  • @user-lg3jo6ih1t
    @user-lg3jo6ih1t Před 3 měsíci +1

    I was searching for Quantization basics and could not find relevant videos... this is a life-saver!! thanks and please keep up the amazing work!

  • @user-rk5mk7jm7r
    @user-rk5mk7jm7r Před 5 měsíci +1

    Thanks a lot for the fantastic tutorial. Looking forward to the more series on the LLM quantization!👏

  • @user-qo7vr3ml4c
    @user-qo7vr3ml4c Před měsícem +1

    Thank you for the great content. Especially the goal of QAT to have a wider loss function and how that makes it robust to errors due to quantization. Thank you.

  • @AbdennacerAyeb
    @AbdennacerAyeb Před 6 měsíci +4

    Keep Going. This is perfect. Thank you for the effort you are making

  • @user-td8vz8cn1h
    @user-td8vz8cn1h Před 3 měsíci +1

    This is one of a few channels that I subscribed to after watching one video. Your content is very easy to follow and you are covering topic holistically with additional clarifications, what a man)

  • @jiahaosu
    @jiahaosu Před 5 měsíci +1

    The best video about quantization, thank you very much!!!! It really helps!

  • @myaseena
    @myaseena Před 6 měsíci +1

    Really high quality exposition. Also thanks for providing the slides.

  • @mandarinboy
    @mandarinboy Před 5 měsíci

    Great introductory video! Looking forward to GPTQ and AWQ

  • @Aaron-hs4gj
    @Aaron-hs4gj Před 3 měsíci +1

    Excellent explanation, very intuitive. Thanks so much! ❤

  • @jaymn5318
    @jaymn5318 Před 4 měsíci +1

    Great lecture. Clean explanation of the field and gives an excellent perspective on these technical topics. Love your lectures. Thanks !

  • @jaymn5318
    @jaymn5318 Před 4 měsíci

    Great lecture. Clean explanation of the field and gives a excellent perspective on these technical topics.

  • @ojay666
    @ojay666 Před 3 měsíci +1

    Fantastic tutorial!!!👍👍👍I’m hoping that you will post a tutorial on model pruning soon🤩

  • @HeyFaheem
    @HeyFaheem Před 6 měsíci +1

    You are a hidden gem, my brotherr

  • @NJCLM
    @NJCLM Před 5 měsíci +1

    Great video ! Thank you !!

  • @sebastientetaud7485
    @sebastientetaud7485 Před 4 měsíci +1

    Excellent Video ! Grazie !

  • @koushikkumardey882
    @koushikkumardey882 Před 6 měsíci

    becoming a big fan of your work!!

  • @manishsharma2211
    @manishsharma2211 Před 6 měsíci

    beautiful again, thanks for sharing these

  • @RaviPrakash-dz9fm
    @RaviPrakash-dz9fm Před měsícem +1

    Legendary content!!

  • @ngmson
    @ngmson Před 6 měsíci +1

    Thank your for your sharing.

  • @bluecup25
    @bluecup25 Před 6 měsíci +1

    Thank you, super clear

  • @Youngzeez1
    @Youngzeez1 Před 6 měsíci +1

    wow, what an eye-opener! I read lots of research papers but mostly confusing! but your explanation just opened my eyes! Thank you. Please can you do a video on the quantization of vision transformers for object detection?

  • @aminamoudjar4561
    @aminamoudjar4561 Před 6 měsíci +1

    Very helpful thank you so much

  • @TheEldadcohen
    @TheEldadcohen Před 5 měsíci

    Umar I've seen many of your videos and you are a great teacher! Thank you for your effort in explaining in plain (Italian accent) English all of these complicated topics.
    Regarding the content of the video - you showed the quantization-aware training and you were surprised of the worse result it showed in comparison to the post-training quantization in the concrete example you made.
    I think it is because you trained the post-training quantization on the same data that you tested it on, so the parameters learned (alpha, beta) are overfitted to the test data, that's why the accuracy was better. I think that if you had tested it with true test data, you probably would have seen the result you anticipated.

  • @andrewchen7710
    @andrewchen7710 Před 4 měsíci +2

    Umar, I've watched your videos on llama, mistral, and now quantization. They're absolutely brilliant and I've shared your channel to my colleagues. If you're in Shanghai, allow me to buy you a meal haha!
    I'm curious of your research process. During the preparation of your next video, I think it would be neat if you document the timeline of your research/learning, and share it with us in a separate video!

    • @umarjamilai
      @umarjamilai  Před 4 měsíci +1

      Hi Andrew! Connect with me on LinkedIn and we can share our WeChat. Have a nice day!

    • @Patrick-wn6uj
      @Patrick-wn6uj Před 3 měsíci

      Glad to see fellow shanghai people here hhhhhhh

  • @user-pe3mt1td6y
    @user-pe3mt1td6y Před 4 měsíci

    Need more advanced videos about advanced Quantization!

  • @user-kg9zs1xh3u
    @user-kg9zs1xh3u Před 6 měsíci +1

    vary good

  • @amitshukla1495
    @amitshukla1495 Před 6 měsíci +1

    wohooo ❤

  • @tetnojj2483
    @tetnojj2483 Před 5 měsíci

    Nice video :) A video on the .gguf file format for models would be very interesting :)

  • @ziyadmuhammad3734
    @ziyadmuhammad3734 Před 27 dny

    Thanks!

  • @user-hd7xp1qg3j
    @user-hd7xp1qg3j Před 6 měsíci +1

    One request could you explain mixture of experts I bet you can breakdown the explanation good

  • @asra1kumar
    @asra1kumar Před 3 měsíci

    Thanks

  • @Erosis
    @Erosis Před 6 měsíci +1

    You're making all of my lecture materials pointless! (But keep up the great work!)

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf Před 6 měsíci

    Valeu!

  • @tubercn
    @tubercn Před 6 měsíci

    Thanks, Great video🐱‍🏍🐱‍🏍
    But I have a question, because we'll dequantize the output of the last layer by calibration, why we need another "torch.quantization.DeQuantStub()" layer in the model to dequantize the output, it seems we have two dequantizes consequently

  • @swiftmindai
    @swiftmindai Před 6 měsíci

    I noticed a small correction needs to done at timestamp @28:53 [slide: Low precision matrix multiplication]. In the first line, the dot products between each row of X with each column of Y [Instead of Y, it should be W - the weight matrix]

    • @umarjamilai
      @umarjamilai  Před 6 měsíci +1

      You're right, thanks! Thankfully the diagram of the multiply block is correct. I'll fix the slides

  • @pravingaikwad1337
    @pravingaikwad1337 Před 2 měsíci

    For one layer Y = XW + b, if X, W and b are quantized so we get Y in the quantized form, then what is the need of dequantizing this Y to feed it to the next layer?

  • @bamless95
    @bamless95 Před 3 měsíci

    Be careful, cpython does not do JIT compilation, it is a pretty stragithforward stack-based bytecode interpreter

    • @umarjamilai
      @umarjamilai  Před 3 měsíci

      Bytecode has to be converted into machine code somehow. That's also how .NET works: first C# gets compiled into MSIL (an intermediate representation), and then it just-in-time compiles the MSIL into the machine code for the underlying architecture.

    • @bamless95
      @bamless95 Před 3 měsíci

      Not necessarily, bytecode can just be interpreted in place. In a loose sense it is being "converted" to machine code, meaning that we are executing different snippets of machine code through branching, but JIT compilation has a very different meaning in the compiler and interpreter field. What python is really doing is executing a loop and a switch branching on every possible opcode. By looking at the interpreter implementation on the cpython github repo in `Python/ceval.c` and `Python/generated_cases.c.h` (alas youtube is not letting me post links) you can clearly see there is no JIT compilation involved.

    • @bamless95
      @bamless95 Před 3 měsíci

      What you are saying about C# (and for that matter java and some other languages like luaJIT or v8 javascript) is indeed true, they typically JIT the code either before or during interpretation. But cpython is a much simpler (and thus slower) implementation of a bytecode interprer, that does not implement neither JIT compilation nor any form of serious code optimization (aside from a fairly rudimentary peephole optimization step)

    • @bamless95
      @bamless95 Před 3 měsíci

      Don't get me wrong, I think the video is phenomenal. Just wanted to correct a little imperfection that, as a programming language nerd, I feel it is important to get right. Also, greetings from italy! It is good for once to see a fellow Italian doing content that is worth watching on YT 😄

  • @AleksandarCvetkovic-db7lm
    @AleksandarCvetkovic-db7lm Před 2 měsíci

    Could the difference in accuracy between Static/Dynamic quantization and Quantization Aware Training be because the model was trained for 5 epochs for Static/Dynamic Quant and only one epoch for Quant Aware training? I tend to think that 4 more epochs make more difference than Quantization method

  • @lukeskywalker7029
    @lukeskywalker7029 Před 3 měsíci

    @Umar Jamil you said most embedded devices dont support floating point operatins at all? Is that right? What would be an example and how is that chip architecture called? Does an RaspberryPi or an Arduino operate on only integer operations internally?

  • @venkateshr6127
    @venkateshr6127 Před 6 měsíci

    Could you please make a video on how to make tokenizers for other languages than English please.

  • @dzvsow2643
    @dzvsow2643 Před 6 měsíci

    Aslamu aleykum Brother.
    Thanks for your videos!
    I have been working on game development using pygame for a while and I just want to start deep learning in python so could you make a road map video?! Thank you again

    • @umarjamilai
      @umarjamilai  Před 6 měsíci +1

      Hi! I will do my best! Stay tuned

  • @theguyinthevideo4183
    @theguyinthevideo4183 Před 4 měsíci

    This may be a stupid question, but what's stopping us from just setting the weights and biases to be in integer form? Is it due to the nature of backprop?

    • @umarjamilai
      @umarjamilai  Před 4 měsíci +1

      Forcing the weights and biases to be integers means adding more constraints to the gradient descent algorithm, which is not easy and computationally expensive. It's like I ask you to solve the equation x^2 - 5x + 4 = 0 but only for integer X. This means you can't just use the formula you learnt in high school for quadratic equations, because that returns real numbers.
      Hope it helps

  • @elieelezra2734
    @elieelezra2734 Před 6 měsíci +1

    Umar, thanks for all your content. I step up a lot thanks to your work! But there is something I don't get about quantization. Let's say you quantize all the weights of your large model. The prediction is not the same anymore! Does it mean you need to dequantize the prediction? If yes, you do not talk about it right? Can I have your email to get more details please?

    • @umarjamilai
      @umarjamilai  Před 6 měsíci +1

      Hi! Since the output of the last layer (the matrix Y) will be dequantized, the prediction of the output will be "the same" (very similar) as the dequantized model. The Y matrix of each layer is always dequantized, so that the output of each layer is more or less equal to the dequantized model

    • @alainrieger6905
      @alainrieger6905 Před 6 měsíci

      Hi thanks for your answer​@@umarjamilai
      Does it mean, for the post training quantization, that the more the layers in a model, the greater is the difference between the quantized and dequantized model since the error accumulates at each New layer? Thanks in advance

    • @umarjamilai
      @umarjamilai  Před 6 měsíci

      @@alainrieger6905 That's not necessarily true, because the error in one layer may be "positive", and in another "negative", and they may compensate for each other. For sure the number of bits used for quantization is a good metric on the quality of quantization: if you use less bits, you will have more error. It's like you have an image that is originally 10 MB, and you try to compress it to 1 MB or 1 KB. Of course in the latter case you'd lose much more quality than the first one.

    • @alainrieger6905
      @alainrieger6905 Před 6 měsíci

      ​@@umarjamilaithanks you Sir! Last question : when you talk about dequantizing layer's activations, does it mean that the values go back to 32 bits format ?

    • @umarjamilai
      @umarjamilai  Před 6 měsíci +1

      @@alainrieger6905 yes, it means going back to floating-point format

  • @sabainaharoon7050
    @sabainaharoon7050 Před 4 měsíci

    Thanks!

  • @007Paulius
    @007Paulius Před 6 měsíci

    Thanks