Quantize LLMs with AWQ: Faster and Smaller Llama 3

Sdílet
Vložit
  • čas přidán 25. 04. 2024
  • Explore how to make LLMs faster and more compact with my latest tutorial on Activation Aware Quantization (AWQ)! In this video, I demonstrate how to apply AWQ to quantize Llama 3, achieving a model that's not only quicker but also smaller than its non-quantized counterpart. Dive into the details of the process and see the benefits in real-time. If you found this video helpful, don't forget to like, comment, and subscribe for more insightful content like this!
    Join this channel to get access to perks:
    / @aianytime
    To further support the channel, you can contribute via the following methods:
    Bitcoin Address: 32zhmo5T9jvu8gJDGW3LTuKBM1KPMHoCsW
    UPI: sonu1000raw@ybl
    GitHub: github.com/AIAnytime/Quantize...
    Activation Aware Quantization Research paper: arxiv.org/pdf/2306.00978
    Quantized Model on HF here: huggingface.co/skuma307/Llama...
    #llama3 #genai #ai
  • Věda a technologie

Komentáře • 17

  • @mehmetbakideniz
    @mehmetbakideniz Před 14 dny

    fantastic video! I will watch the other videos. Definetely a very talented tutor here!

  • @joserfjunior8940
    @joserfjunior8940 Před 4 dny

    Cool !

  • @Suparious
    @Suparious Před 19 dny +1

    Great video! Thank-you for sharing.

  • @IdealVijay-
    @IdealVijay- Před 4 dny

    Does quantizing a model make it less accurate? How many parameters will be in the Quantize Model? If It is 13B then how quantizing the model is making it faster?

  • @BeegBrain-zy7qc
    @BeegBrain-zy7qc Před 19 dny

    new sub from usa. may i suggest in depth guide(s) on ontologies, knowledge graphs, and query analysis. many thanks for great info.

  • @christiand6312
    @christiand6312 Před 13 dny

    Can we collab on a project, also Cuda Vs Triton, and also inference evaluations.
    How do you make Research into Code?
    Can we work together?

  • @cristianaguilar4253
    @cristianaguilar4253 Před 19 dny

    Thank

  • @thisurawz
    @thisurawz Před 19 dny

    how to quantize the multimodal llms?

  • @maitreyazalte6971
    @maitreyazalte6971 Před 18 dny

    Doubt : In this case, we are downloading the entire model first and then quantizing it. Is there any way to quantize a model on the fly during loading? Since I'm GPU poor, I might not be able to download the entire model, and hence can't quantize. Please suggest something...

  • @lazypunk794
    @lazypunk794 Před 19 dny +1

    awq has lower throughput than unquantized model when serving using VLLM. Do you know if there are quantization methods that can also increase throughput?

    • @nashtashasaint-pier7404
      @nashtashasaint-pier7404 Před 19 dny

      +1

    • @ShaunPrince
      @ShaunPrince Před 19 dny +1

      This is only true in inappropriate scenarios, where you don't have flash attention compiled, or if you are using a old gpu, like the colab T4.
      Try to avoid using the pre-made docker images, ensure that all your hardware is enabled to it's best ability. Always use the latest python 3.11.x, latest cuda developers toolkit 12.x
      Dont use the cuda GPU drivers, use drivers that you make yourself or that are meant for your operating system.
      Then this stupid argument about vllm unquantized is faster is lo longer true. Not many people want to take the time to learn about and properly prepare their inference systems. AWQ is meant to save memory, Exl2 is like for finetuning for your available VRAM with their variable bbw and hb.

  • @ragibhasan2.0
    @ragibhasan2.0 Před 19 dny +1

    Is "Fine Tuning of LLMs" playlist enough for finetuning any llam model?

  • @IdPreferNot1
    @IdPreferNot1 Před 18 dny

    1.58 bytes seems so promising but i understand it has to be part of the original training, you cant post-training quantize. Have you heard of anyone actually training any models with this?

  • @sneharoy3566
    @sneharoy3566 Před 19 dny

    Noice