Video není dostupné.
Omlouváme se.

NVIDIA Nemotron-4 340B Q8_0 running on AMD Epyc 9374F - real time generation speed

Sdílet
Vložit
  • čas přidán 11. 07. 2024
  • I was a bit disappointed that no one had yet run NVIDIA Nemotron-4 340B on a CPU, so I took up the challenge myself and here are the initial results after 3 days of work.

Komentáře • 6

  • @rehanahmed1939
    @rehanahmed1939 Před 16 dny

    Can you please create a full video how you did it or any other source from where I can learn .

    • @dreamingfairy8804
      @dreamingfairy8804  Před 15 dny

      You want to run Nemotron-4 340B on your PC? The following steps are needed:
      1) Download the model from huggingface.co/nvidia/Nemotron-4-340B-Instruct
      2) Convert the model to safetensors format with this script: github.com/fairydreaming/export-nemo-to-safetensors (install script requirements first)
      3) Download my branch of llama.cpp: github.com/fairydreaming/llama.cpp/tree/nemotron
      4) Compile llama.cpp from source code
      5) Convert the safetensors model to GGUF with llama.cpp convert_hf_to_gguf.py script (install conversion script requirements first)
      6) Quantize the model with llama-quantize so it will fit in your RAM
      7) Run the quantized model as shown in the video
      Good luck!

  • @johann09
    @johann09 Před 24 dny

    How much RAM? This is insane

    • @dreamingfairy8804
      @dreamingfairy8804  Před 24 dny +4

      12 channels x 32 GB = 384 GB

    • @Philip8888888
      @Philip8888888 Před 24 dny

      @@dreamingfairy8804 Dare I ask, how much does such a server cost to buy and how many watts does it burn during inference and idle?