Video není dostupné.
Omlouváme se.

Running MPT-30B on CPU - You DON"T Need a GPU

Sdílet
Vložit
  • čas přidán 27. 06. 2023
  • In this video, I will show you how to run a MPT-30B model with 8K context window on a CPU without the need of a Powerful GPU. We will be utilizing the GGML 4-bit quantized models with ctransformers package.
    Subscribe if you found this helpful!
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬
    ☕ Buy me a Coffee: ko-fi.com/prom...
    |🔴 Support my work on Patreon: Patreon.com/PromptEngineering
    🦾 Discord: / discord
    ▶️️ Subscribe: www.youtube.co...
    📧 Business Contact: engineerprompt@gmail.com
    💼Consulting: calendly.com/e...
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    LINKS:
    MPT-30B Github: github.com/aba...
    The Bloke HF Repo: huggingface.co...
    LocalGPT: • LocalGPT: OFFLINE CHAT...
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    All Interesting Videos:
    Everything LangChain: • LangChain
    Everything LLM: • Large Language Models
    Everything Midjourney: • MidJourney Tutorials
    AI Image Generation: • AI Image Generation Tu...

Komentáře • 52

  • @TheInternalNet
    @TheInternalNet Před 11 minutami

    Okay this is really really amazing. Setting up some LLMs on my Dell dual processor server next week. I will absolutely be using this to configure everything

  • @oneville_0
    @oneville_0 Před rokem +5

    You are making amazing contents currently nowadays

  • @dfas1602
    @dfas1602 Před rokem +3

    hyped for this video! You make great content currently! Thank you very much

  • @alchemication
    @alchemication Před rokem +6

    Thanks for sharing! I was so pumped to try it, but it took on avg 130sec to generate a short response to my prompt with 800 characters 😢 I’m on a powerful macbook pro with 🍏 silicon. It might be ok for some offline batch processing, but for real time I conclude you still do need a GPU 😂

    • @gileneusz
      @gileneusz Před rokem

      I ran this on M1 MacBook Air, with 16GB of RAM and it used about 10 GB of SWAP just to run, and the output was even slower. But it's nothing to be mad about. This model was designed to run on a single A100 40GB, so it's quite impressive it's running at all. I would wait for more orca-like models that would be smaller and better. They will appear in 1-3 months imho

    • @trivPZ
      @trivPZ Před rokem

      did you enable LLAMA_METAL during building llama.cpp and then later -ngl 1?

    • @StijnSmits
      @StijnSmits Před rokem

      @@gileneusz Of course, because the model itself needs 18GB; model size = VRAM needed.

  • @BoolitMagnet
    @BoolitMagnet Před 8 měsíci

    Nice! Appreciate the full explanation of the code too.👌

  • @-someone-.
    @-someone-. Před rokem +3

    How many 8gb pi’s running in a cluster would I need to do this? 👋👍

  • @HassanAllaham
    @HassanAllaham Před měsícem

    Thanks for the good content. I wonder how this compares to OLLAM? 🌹🌹🌹

  • @MuhammadZubair-cu6cx
    @MuhammadZubair-cu6cx Před rokem +1

    I have observed that when processing user queries, the CPU usage increases but I do not receive a response.
    [user]: What is the capital of France?
    [assistant]:
    [user]"

  • @doczooc
    @doczooc Před 11 měsíci +1

    Unfortunately, according to lots of comments on the videos by 1littlecoder and Prompt Engineering on exactly this topic, this does not work under Windows. The AI answers just stay blank and the user prompt comes up again. It seems to work fine on Mac, as both tutorials show the exact same steps, both on Mac.
    Has anyone gotten this to run on Windows? If so, do you have any tips?
    I suspect that this may be an easy fix, where the output is generated, but not correctly routed to the terminal. Task manager shows 100% CPU load for a reasonable amount of time and then goes back to idlel when the user: prompt reappears.

    • @engineerprompt
      @engineerprompt  Před 11 měsíci +1

      It might be related to your available RAM

    • @doczooc
      @doczooc Před 11 měsíci

      @@engineerprompt I have 32GB of RAM. Could anyone get it to run with either this or more?

    • @renanferreira7196
      @renanferreira7196 Před 10 měsíci

      @@doczoocI'm facing this problem also... Did you solve it?

  • @user-fy2nz4wh5f
    @user-fy2nz4wh5f Před 7 měsíci

    How to check there on hugging face what RAM size will be needed?

  • @guysarkinskiy2401
    @guysarkinskiy2401 Před rokem

    Hi :)
    When the MPT-30B model expected to be integrated with localGPT repo?

  • @alx8439
    @alx8439 Před rokem

    What is the point of "number of tokens to generate" setting? Is it some kind of adjustment to affect when to hardtop the inference if the response will be more than the value we set in there?

    • @ThomasTomiczek
      @ThomasTomiczek Před rokem +1

      Pretty much yes. It can also allow the model to reject requests if input + output are larger than the context window without doing possibly very long processing first.

  • @DoppsPkin
    @DoppsPkin Před rokem

    Can you show how to offload some layers to gpu please

  • @oxytic
    @oxytic Před rokem

    Hi bro , i am waiting all you video tutorials all are excellent. I religious i follow your langchai tutorial its very comprehensive and easy to understand. I am requesting to make " how to make Langchai custom agent" like SalesGPT. Please make this video. Thia really help many beginner like me and how comunity. Kindly do the needful please

    • @engineerprompt
      @engineerprompt  Před rokem +1

      Thank you for your comment. I am planning on making a lot more content on agents and langchain. Stay tuned.

  • @balubalaji9956
    @balubalaji9956 Před 11 měsíci

    m2 max with 96 Ram~ $4200.
    i wonder how does m2 gpu compare with nvidia for local running

    • @engineerprompt
      @engineerprompt  Před 11 měsíci

      3090/4090 runs faster than m2. Haven’t tested the lower end gpus

  • @tamanna-229
    @tamanna-229 Před rokem

    is there any way to implement it on 32 gb RAM?

  • @joshbarron7406
    @joshbarron7406 Před rokem

    I have a M2 Max with 32 gb of ram. I am on a search for what’s the best open source model I can run on it. If I can use the GPU that’s a huge plus

    • @engineerprompt
      @engineerprompt  Před rokem +1

      Look at the ggml format models. You will be able to run 33B models on it.

  • @obamabinbiden9762
    @obamabinbiden9762 Před rokem +1

    I can't get localgpt working. help please

  • @adriangabriel3219
    @adriangabriel3219 Před rokem

    But 4-bit will greatly decrease the quaity of the inference correct?

    • @engineerprompt
      @engineerprompt  Před rokem

      Yes, it does but I suspect it will be still better than the smaller models

  • @ekstrajohn
    @ekstrajohn Před rokem +2

    Did someone test this model on a normal modern CPU ? Can it generate a response in less than 120 seconds? I doubt it. :)

  • @Pingupinga123
    @Pingupinga123 Před rokem +1

    Would this works with llama2 70b ?

  • @aquilaarchviz6968
    @aquilaarchviz6968 Před rokem

    When i ask the model there is no answer generation , it shows assistant then jumps to user again , 2 x Xeon 2650 v4 ,64 GB of Ram 😟

  • @daryladhityahenry
    @daryladhityahenry Před rokem

    Hi! I'm trying this right away, and getting weird problem: the AI didn't generate any word at all..
    Do you know why? I'm just asking "Who are you?". Lol!
    Wait for word generation about 40 - 60 seconds maybe ( and look at task manager, the CPU reacted to my prompt and working ), but the AI didn't generate anything.

    • @engineerprompt
      @engineerprompt  Před rokem

      If you look at the video, The model takes some time for generation to start. You might have to wait longer but once the generation starts, it seems to be faster.

    • @daryladhityahenry
      @daryladhityahenry Před rokem

      @@engineerprompt no.. I mean, it's already finished. I can type again for the next chat. but what's generated is purely empty string/text. nothing.
      I try to continue to chat 4 times more, and still empty...

    • @KiranShivaraju
      @KiranShivaraju Před rokem

      @@daryladhityahenry same...

    • @shyamkrishnan9769
      @shyamkrishnan9769 Před rokem

      @@KiranShivaraju Same for me as well

    • @jazzkaur3581
      @jazzkaur3581 Před rokem

      @@shyamkrishnan9769 same , disappointed the creator isn't helping

  • @redleader7988
    @redleader7988 Před rokem +1

    I can't run MPT-30B on my RTX 3090, but I can on my CPU?

    • @jeffwads
      @jeffwads Před rokem +2

      RAM dude. If you have ample vRAM you can do it.

    • @gileneusz
      @gileneusz Před rokem +2

      MPT-30B was designed to run on A100, not on slow 3090 🤣😜

    • @redleader7988
      @redleader7988 Před rokem +3

      @@gileneusz RTX 3090 is one of the best consumer GPUs available. Call it slow if you want.

  • @ash1kh
    @ash1kh Před rokem

    First of all if someone doesn't have a stable internet connection how it differs from downloading from cli rather than downloading via browser? very questionable unless it has torrent link that enables pause and resume regardless of connection. Secondly if you are testing AI with question like who was the first president bla bla bla or where is the city etc etc- rather have an excel sheet of list of president data or city data and (Command+F/CTRL+F) to find faster. Why using 20+ GB for ram / GPU for simple task You loose my interest.

    • @HassanAllaham
      @HassanAllaham Před měsícem

      The video is not talking about downloading from CLI. The point is that it is using Hugging Face Hub downloader which does not work well if the internet connection is bad (you will lose the part already downloaded if any disconnection happens - There is no resume option when using Hugging Face Hub downloader which is a real bad problem when trying to download huge sizes). In comparison, most browsers support resumes (sometimes auto resumes) if the URL provider supports this (which is the case in huggingface).
      Concerning the testing: He mentioned he is not testing the model. He is just showing the way to deal with LLMs using CPU, (I believe based on using ctransformers). The main target is to show how to use LLMs without the need of a good powerful GPU (The cost of such a GPU is a very bad barrier for poor people). The target is to show how to run LLMa on the CPU if the RAM size is enough. If you are interested in testing such a model, you should not depend on any other tests. You must do it yourself and for your targets. You should not feel upset if any content is not what you want. Whatever the content provided we must appreciate it because it is not easy to waste time providing it also not all people have the facilities needed. Keep cool and have a nice day ma friend