Quantize LLMs with AWQ: Faster and Smaller Llama 3
Vložit
- čas přidán 25. 04. 2024
- Explore how to make LLMs faster and more compact with my latest tutorial on Activation Aware Quantization (AWQ)! In this video, I demonstrate how to apply AWQ to quantize Llama 3, achieving a model that's not only quicker but also smaller than its non-quantized counterpart. Dive into the details of the process and see the benefits in real-time. If you found this video helpful, don't forget to like, comment, and subscribe for more insightful content like this!
Join this channel to get access to perks:
/ @aianytime
To further support the channel, you can contribute via the following methods:
Bitcoin Address: 32zhmo5T9jvu8gJDGW3LTuKBM1KPMHoCsW
UPI: sonu1000raw@ybl
GitHub: github.com/AIAnytime/Quantize...
Activation Aware Quantization Research paper: arxiv.org/pdf/2306.00978
Quantized Model on HF here: huggingface.co/skuma307/Llama...
#llama3 #genai #ai - Věda a technologie
fantastic video! I will watch the other videos. Definetely a very talented tutor here!
Cool !
Great video! Thank-you for sharing.
Does quantizing a model make it less accurate? How many parameters will be in the Quantize Model? If It is 13B then how quantizing the model is making it faster?
new sub from usa. may i suggest in depth guide(s) on ontologies, knowledge graphs, and query analysis. many thanks for great info.
Can we collab on a project, also Cuda Vs Triton, and also inference evaluations.
How do you make Research into Code?
Can we work together?
Thank
how to quantize the multimodal llms?
Doubt : In this case, we are downloading the entire model first and then quantizing it. Is there any way to quantize a model on the fly during loading? Since I'm GPU poor, I might not be able to download the entire model, and hence can't quantize. Please suggest something...
awq has lower throughput than unquantized model when serving using VLLM. Do you know if there are quantization methods that can also increase throughput?
+1
This is only true in inappropriate scenarios, where you don't have flash attention compiled, or if you are using a old gpu, like the colab T4.
Try to avoid using the pre-made docker images, ensure that all your hardware is enabled to it's best ability. Always use the latest python 3.11.x, latest cuda developers toolkit 12.x
Dont use the cuda GPU drivers, use drivers that you make yourself or that are meant for your operating system.
Then this stupid argument about vllm unquantized is faster is lo longer true. Not many people want to take the time to learn about and properly prepare their inference systems. AWQ is meant to save memory, Exl2 is like for finetuning for your available VRAM with their variable bbw and hb.
Is "Fine Tuning of LLMs" playlist enough for finetuning any llam model?
Yes!
@@AIAnytime Thanks for creating this type of playlist☺
1.58 bytes seems so promising but i understand it has to be part of the original training, you cant post-training quantize. Have you heard of anyone actually training any models with this?
Noice