RoboTF AI
RoboTF AI
  • 13
  • 34 858
Mistral 7B LLM AI Leaderboard: Rules of Engagement and first GPU contender Nvidia Quadro P2000
Mistral 7B LLM AI Leaderboard: Rules of Engagement and first GPU contender Nvidia Quadro P2000
This week in the RoboTF lab:
We go over the goals, and rules of engagement along with launching the Mistral 7B leaderboard!
We then bring in an older Quadro P2000 and put it to the tests.
Leaderboard is live: robotf.ai/Mistral_7B_Leaderboard
Leaderboard reports (from these videos if you want a hands on look): robotf.ai/Mistral_7B_Leaderboard_Reports
Model in testing: huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/tree/main
Just a fun day in the lab, grab your favorite relaxation method and join in.
Recorded and best viewed in 4K
zhlédnutí: 189

Video

Mistral 7B LLM AI Leaderboard: Baseline Testing Q3,Q4,Q5,Q6,Q8, and FP16 CPU Inference i9-9820X
zhlédnutí 180Před 9 hodinami
Mistral 7B LLM AI Leaderboard: Baseline Testing Q3,Q4,Q5,Q6,Q8, and FP16 CPU Inference i9-9820X This week in the RoboTF lab: Setting the rest of the baselines for a Mistral 7B Leaderboard with CPU inference. We will test Q4,Q5,Q6,Q8, and FP16 quants and then bring the together with Q3 from the last video for our full set of baselines. Stay tuned for the leaderboard. Leaderboard is live: robotf....
Mistral 7B LLM AI Leaderboard: Baseline Testing Q3 CPU Inference i9-9820X
zhlédnutí 251Před 19 hodinami
Mistral 7B LLM AI Leaderboard: Baseline Testing CPU Inference i9-9820X This week in the RoboTF lab: Setting a baseline for a Mistral 7B Leaderboard with CPU inference....more to come! Stay tuned for the leaderboard. Leaderboard is live: robotf.ai/Mistral_7B_Leaderboard Leaderboard reports (from these videos if you want a hands on look): robotf.ai/Mistral_7B_Leaderboard_Reports Model in testing:...
LocalAI LLM Testing: Part 2 Network Distributed Inference Llama 3.1 405B Q2 in the Lab!
zhlédnutí 734Před 21 dnem
Part 2 on the topic of Distributed Inference! This week we are taking Llama 3.1 405B at a Q2 quant running 8k of context through the gauntlet with several GPUs and across Nodes in a distributed swarm of llama.cpp workers! The whole lab is getting involved in this one to run a single giant model. Both GPU Kubernetes nodes 3x 4060Ti 16GB 6x A4500 20GB 1x 3090 24GB LocalAI docs on distributed infe...
LocalAI LLM Testing: Distributed Inference on a network? Llama 3.1 70B on Multi GPUs/Multiple Nodes
zhlédnutí 2,9KPřed měsícem
This week in the RoboTF lab: Blown power supply Saying goodbye to some of the 4060's Most importantly hitting the topic of Distributed Inference! It's a long video... This week we are taking Llama 3.1 70B at a Q5 quant running 56k of context through the gauntlet with several GPUs and across Nodes in a distributed swarm of llama.cpp workers! The whole lab is getting involved in this one to run a...
LocalAI LLM Testing: Llama 3.1 8B Q8 Showdown - M40 24GB vs 4060Ti 16GB vs A4500 20GB vs 3090 24GB
zhlédnutí 5KPřed měsícem
3090 24GB Has joined the lab Llama 3.1 models released this week. This week we are taking Llama 3.1 8B (huggingface.co/mradermacher/Meta-Llama-3.1-8B-Instruct-GGUF) at a Q8 quant running 32k of context through the gauntlet with several GPUs: Tesla M40 24GB 4060Ti 16GB A4500 20GB 3090 24GB Link to blog on Llama 3.1 and memory requirements huggingface.co/blog/llama31 Just a fun night in the lab, ...
LocalAI LLM Testing: How many 16GB 4060TI's does it take to run Llama 3 70B Q4
zhlédnutí 6KPřed měsícem
Answering some viewer questions and running Llama 3 70B Q4 K M with the 4060Ti's - how many does it take to run it? Just a fun night in the lab, grab your favorite relaxation method and join in. Recorded and best viewed in 4K
LocalAI LLM Testing: Can 6 Nvidia A4500's Take on the WizardLM 2 8x22b?
zhlédnutí 1,1KPřed 2 měsíci
Taking viewers along for a ride with the newer main gpu node configuration in a quick test with 6x Nvidia A4500's up against a 100GB model of WizardLM 2 8x22B Q5 KM Just a fun night in the lab, grab your favorite relaxation method and join in. Recorded and best viewed in 4K
LocalAI LLM Testing: Viewer Questions using mixed GPUs, and what is Tensor Splitting AI lab session
zhlédnutí 1,7KPřed 2 měsíci
Attempting to answer good viewer questions with a bit of testing in the lab. We will be taking a look at using different GPUs in a mixed scenario, along with going down the route of tensor splitting to best effect your mixed GPU machines We will be using LocalAI, and an Nvidia 4060 Ti with 16GB VRAM along with a Tesla M40 24GB. Grab your favorite after work or weekend enjoyment tool and watch s...
What's on the Robotf-AI Workbench Today?
zhlédnutí 472Před 2 měsíci
What's on the Robotf-AI Workbench Today? 7 GPU Node - 6x A4500's 20GB and a single 4060TI 16GB
LocalAI LLM Testing: i9 CPU vs Tesla M40 vs 4060Ti vs A4500
zhlédnutí 6KPřed 2 měsíci
Sitting down to run some tests with i9 9820x, Tesla M40 (24GB), 4060Ti (16GB), and an A4500 (20GB) Rough edit in lab session Recorded and best viewed in 4K
LocalAI Testing: Viewer Question LLM context size, & quant testing with 2x 4060 Ti's 16GB VRAM
zhlédnutí 1,3KPřed 3 měsíci
Attempting to answer a good viewer question with a bit of testing in the lab. We will look at how context size effects VRAM usage, and also address speed testing with different quant sizes with Codestral 22B. We will using LocalAI, and two Nvidia 4060 Ti's with 16GB VRAM each. Grab your favorite after work or weekend enjoyment tool and watch some GPU testing Recorded and best viewed in 4K
LocalAI LLM Single vs Multi GPU Testing scaling to 6x 4060TI 16GB GPUS
zhlédnutí 9KPřed 5 měsíci
An edited version of a demo I put together for a conversation amongst friends about single vs multiple GPU's when running LLM's locally. We walk through testing from a single to up to 6x 4060TI 16GB VRAM GPUs. Github Repo: github.com/kkacsh321/st-multi-gpu See the Streamlit app and results here: gputests.robotf.ai/ Recorded and best viewed in 4K

Komentáře

  • @luisff7030
    @luisff7030 Před 3 hodinami

    I made a test with LM Studio for only 1 GPU 4060TI 16GB: 100% CPU -> 1.38 tk/s (00/80 GPU OffLoad) CPU + GPU -> 2.00 tk/s (31/80 GPU OffLoad) 100% GPU -> 0.42 tk/s (80/80 GPU OffLoad) CPU Ryzen 9 7900 (105W power config), MSI RTX4060TI 16GB (Core 2895 MHz + VRAM 2237 MHz), DDR5 96GB 6000MT/s mradermacher Meta-Llama-3-70B-Instruct.Q4_K_M.gguf

  • @user-ik3jh7kr5n
    @user-ik3jh7kr5n Před dnem

    what software is this ? The gui i mean that you use where can i download it ?

    • @RoboTFAI
      @RoboTFAI Před dnem

      The testing platform? That's a custom built streamlit/python/langchain app I built specifically for my lab - so it's not really an app I distribute

  • @user-gu2sh1ke8n
    @user-gu2sh1ke8n Před dnem

    Какой зоопарк видеокарт! Вы истинный энтузиаст своего дела :)

  • @DarrenReidAu
    @DarrenReidAu Před 3 dny

    Great breakdown. Since Ollama support for AMD has become decent, a good bang for buck is the MI50 16Gb. I did a similar test for comparison and it comes in a bit about the 4060ti for output, prompt tokens faster due to sheer memory speed (HBM2). ~20 toks/sec out. Not bad for a card that can be had on eBay for $150-$200 usd.

  • @twinnie38
    @twinnie38 Před 3 dny

    Really impressive, congrats ! Do you know the impact of a limited PCIe bus (1x , 4x GEN3) for those GPU cards ?

  • @Matlockization
    @Matlockization Před 4 dny

    I don't believe anyone on CZcams is doing this. Well, at least that is what the CZcams algorithm is telling me. On a separate note, I can remember a few years back when one could use both an AMD & Nvidia GPU's (Maxwell) together and then the game would sort out how much of which GPU was used as back then both Co's had vastly different architectures, that was until Nvidia %ucked things up by putting a stop to all that.

  • @Magic-mz5ww
    @Magic-mz5ww Před 4 dny

    I m curious about AVX512 supported CPUs 🤔 llamafile people said that you can fit even 10x in tok/s ! 😊

  • @246rs246
    @246rs246 Před 4 dny

    Could you try to focus on the accuracy and correctness of the responses generated? Just don't ask typical questions that models are often trained on; try to find something creative that tests the models' authentic logical thinking. Thenks

  • @marekkroplewski6760

    Reflection-70B is all the rage currently. How about trying to run that? There are already some quantized versions. Dude that pulled that model off is Matt Shumer.

    • @RoboTFAI
      @RoboTFAI Před 4 dny

      I actually have already tried it out, it's fairly impressive so far but I know there is some craziness going on around the model, repo, and huggingface/etc so waiting that out before we run it through it's paces. It's very lengthy - but 6x A4500 cluster on the current Q8 quant pulls ~6 TPS - to give an idea of how verbose it is our typical ~100 token prompt to create a Github Action results in a ~1300 Token response in an initial run 🤔

  • @OutsiderDreams
    @OutsiderDreams Před 4 dny

    Can you talk a little more about the hardware you used outside of the GPUs? What mobo, cpu, ram? Thanks!

  • @Ray88G
    @Ray88G Před 6 dny

    4X 3090's would perform better ?

    • @RoboTFAI
      @RoboTFAI Před 5 dny

      absolutely, but def 💰 involved. I have another video (and more coming) that pits a 3090 vs some other cards/etc if interested.

  • @Magic-mz5ww
    @Magic-mz5ww Před 6 dny

    Super awesome 👍 I was looking for some hardware benchmarks! Want to build a "homelab for LLMs"

    • @RoboTFAI
      @RoboTFAI Před 5 dny

      Hey thanks for that! Stay tuned, lots more coming!

  • @Matlockization
    @Matlockization Před 7 dny

    Great video, but I need a magnifying glass to make out the names and numbers of anything.

    • @RoboTFAI
      @RoboTFAI Před 7 dny

      Thanks for the feedback, I've tried to do better for those with smaller screens in more recent videos.

    • @Matlockization
      @Matlockization Před 6 dny

      @@RoboTFAI Thank you.

  • @JarkkoHautakorpi
    @JarkkoHautakorpi Před 7 dny

    Instead of Llama13b please try for example llama3.1:70b-instruct-q8_0 to use all VRAM?

    • @RoboTFAI
      @RoboTFAI Před 7 dny

      Yep for sure, we do that in a lot of other videos. This one was just to focus on that scaling out GPUs isn't for speed, it's for memory capacity to settle a convo amongst some friends.

  • @ricardocosta9336
    @ricardocosta9336 Před 7 dny

    Can you talk more about your kubernetes and hypervisor setup?

    • @RoboTFAI
      @RoboTFAI Před 7 dny

      Coming! Though I don't do hypervisors much anymore (one node under TrueNAS but mainly for my github actions runners through ARC) - mostly bare metal nodes that are low power (n100/i5/etc/etc).....minus the GPU nodes of course!

  • @Matlockization
    @Matlockization Před 8 dny

    1. Is it possible to run the LLM on both the CPU & GPU at the same time ? 2. And how come AMD GPU's aren't used that much in AI ? 3. What do you believe is the minimum Nvidia GPU for AI ? 4. How important is the amount of RAM ?

    • @RoboTFAI
      @RoboTFAI Před 7 dny

      1. Yes! Normally controlled by `gpu_layers` settings in the model - which determines how many layers to offload to the GPU(s), rest will use RAM/CPU 2. Nvidia just mainstream and their support with software/etc is pretty far ahead. AMD is def being used also - you don't hear about it as much but there is tons of large orgs doing big clusters of AMD based cards. 3. That depends on your needs, and your expectations of model response times (TPS). Most models can run on a good CPU if you are patient enough for the responses. 4. Not that important UNLESS you want to be able to #1 - and split models, or run them purely on CPU inference. If so you want as much RAM as possible (same thing we all want from our GPUs!)

    • @Matlockization
      @Matlockization Před 6 dny

      @@RoboTFAI Thank you for your generous response. And I'm now a subscriber.

  • @user-wg3rr9jh9h
    @user-wg3rr9jh9h Před 8 dny

    Does the code produced by the llm actually work? If you have sufficient ram can you show Q4 and Q8 benchmarks?

    • @RoboTFAI
      @RoboTFAI Před 7 dny

      I don't typically test for accuracy, lots of channels out there doing that and it is very subjective depending on your needs. As an engineer most LLM's are good at "boiler plating" code or giving you a place to start and iterate on. All of rest the Quants for Mistral 7B baselines are coming in the next video and we will compare Q3, Q4, Q5, Q6, Q8, and even FP16. These will be used as baselines for future testing!

  • @ZIaIqbal
    @ZIaIqbal Před 9 dny

    Can you try to run the llama3.1 405B model on the CPU and see what kind of response we can get?

    • @RoboTFAI
      @RoboTFAI Před 7 dny

      I haven't tried on pure CPU inference, but I did do it with distributed inference over the network in another video. We can certainly try as I have nodes with the RAM to do it in.

    • @ZIaIqbal
      @ZIaIqbal Před 7 dny

      @@RoboTFAI oh, can you send me the link to the other video, I would be interested to see how you did the distributed setup.

    • @RoboTFAI
      @RoboTFAI Před 6 dny

      @@ZIaIqbal czcams.com/video/CKC2O9lcLig/video.html - is the Llama 3.1 405B Distributed inference video. It's using LocalAI (Llama cpp workers/etc) under the hood: LocalAI docs on distributed inference: localai.io/features/distribute/ Llama.cpp docs: github.com/ggerganov/llama.cp...

  • @InstaKane
    @InstaKane Před 10 dny

    Can you test the phi-3.5 model? Would I be able to run it with 2 RTX 3090?

  • @hienngo6730
    @hienngo6730 Před 10 dny

    If your model fits on a single GPU (which your 13B Q_K_S model does), there's no benefit to running it across multiple GPUs. In fact, at best you'll be flat performance wise spreading the model across more than one GPU. Generally, there will be a slight performance penalty having to coordinate across GPUs for the LLM inference. The primary benefit for multiple GPUs is to run bigger models like a 70B or Mixtral 8x7B that do not fit on a single GPU, or to run batched inference using vLLM. The smaller the model (7B/8B and below in particular), the more impact the single-threaded CPU performance will have on the tokens per second speed. For a LLaMA-2 13B Q5_K_S model on an Intel i9 13900K + 4090 for example, I get 82 tokens per second: llama_print_timings: eval time = 11228.67 ms / 921 runs ( 12.19 ms per token, 82.02 tokens per second) On same machine using 3090, 71 t/s: llama_print_timings: eval time = 8536.02 ms / 614 runs ( 13.90 ms per token, 71.93 tokens per second) If you took one of those 4060 Ti cards and put it into a gaming PC with a current gen i7/i9 or Ryzen X3D CPU, you should see a big improvement in tokens per second.

  • @ArtificialLife-GameOfficialAcc

    undervolt the 3090 and will give you bassically the same performance with around 220-250 watts

  • @whatdafuck-f6y
    @whatdafuck-f6y Před 12 dny

    This is crazy. So what would you choose, 8 A4000 or 2 A5000?

  • @rhadiem
    @rhadiem Před 14 dny

    Me with a 4090 and a 1500w psu chuckling about your concern for burning the house down at 300w. :D I triggered a 15a breaker earlier this week, found out the outlet I have my microwave plugged into is on the same circuit as my home office and I must have been pushing the GPU at the time. So glad I don't have insane power costs like Europe. Btw, if you need a "the youtubers asked for it, it's a business write-off" excuse for a 4090... you really need a 4090 for testing data comparisons for the people. :D Thanks for all the tests. Would love to see a 4070ti in there too, to fight with the 4060.

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      Haha - but I have been known to burn up a power supply or three, a big UPS, couple breakers, etc - luckily lab has a few dedicated 20amp circuits these days. There is absolutely a fire extinguisher hanging in workshop/lab! Hey I would love to buy a 4090, and a million other cards! To be honest, I never really planned on a channel, I put up a video from a discussion with friends (basically to prove them wrong with data) and somehow you folks seem to like what this crazy guy does in his lab? if channel continues to grow and happens to make money one day happy to throw it all back in the channel. For now my budget is not much 💸

    • @rhadiem
      @rhadiem Před 11 dny

      @@RoboTFAI Haha well you've earned this sub, curious what you end up testing next. "Not much" as you have a handful of $1k gpu's. Carry on good sir. o7

  • @rhadiem
    @rhadiem Před 14 dny

    It seems to me, looking at the output of the 4060's running here, that the 4060 is a bit too slow to be a productive experience for interactive work, and is better suited for automated processes where you're not waiting on the output. I see you have your 4060's for sale, would you agree on this? What is your take on the 4060 16gb at this point?

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      I think the 4060 (16GB) is a great card - people will flame me for that but hey. It's absolutely useable for interactive work if you are not expecting lightning-fast responses. Though on small models the 4060 really flies for what it is and how much power they use. Lower, lower, lower your expectations until your goals are met..... I did sell some of my 4060's but only because I replaced them with A4500's in my main rig so they got rotated out. I kept 4 of them which most days are doing agentic things in automations/bots/etc while sipping power or in my kids gaming rigs.

  • @rhadiem
    @rhadiem Před 14 dny

    I would love to know what the cheapest GPU you'd need to use for a dedicated text to speech application running something like XTTSv2 for having your LLM's talking to you as quickly as possible. I imagine speed will be key here, but how much VRAM and what is fast enough? Inquiring minds... I mean we all want our own Iron Man homelab with JARVIS to talk to right?

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      I don't work much with TTS, or STT - but we can go down that road and see where it takes us

  • @rhadiem
    @rhadiem Před 14 dny

    Just saw this, thanks for testing. I already have a 4090, but definitely chasing the almighty VRAM for testing bigger models and running different things at one time. What would you recommend for system ram to run a 6x GPU setup like this?

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      That highly depends on your needs, wants, and wallet - if you want the ability to offload to RAM for any reason (super large models, or keeping KV in ram), the more RAM the better. Otherwise if offloading to the GPUs fully you only really need enough to run the processes which is fairly minimal. My rigs are Kubernetes nodes so I keep them stacked to be able to do anything with them, not just inference/etc.

    • @hienngo6730
      @hienngo6730 Před 10 dny

      The biggest issue you'll run into is the lack of PCIe lanes on the motherboard. If you have a consumer motherboard with recent gen Intel or AMD CPU, you will only be able to run 2 GPUs, 3 if you're lucky. You will likely need to use PCIe extension cables to space out your GPUs as consumer GPUs are generally 3+ slots wide, so you will get no cooling if you stack them next to each other. Once you go over 2 GPUs, you have to go into workstation or server motherboards to get the needed PCIe lanes; so EPYC, XEON, or Threadripper motherboards are needed. Best to buy used from eBay if your wallet doesn't support brand new Threadripper setup. Best bang for the buck will be used EPYC MB + used 3090s. You'll need muliple 1000+ W power supplies to feed the GPUs as well (I run with 2x 1600W)

  • @AlienAnthony
    @AlienAnthony Před 15 dny

    Could you make this run on jentson orins?????

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      Good question - want to sponsor me some to play with? 😁

    • @AlienAnthony
      @AlienAnthony Před 12 dny

      If I get the funds. I was interested in them myself. For price over vram + power consumption. I would invest in a cluster software for these. Speed might be diminished but it's certainly a cheaper option then building a server for inference only using self contained devices.

  • @firsak
    @firsak Před 17 dny

    Please, record your screen in 8k next time, i'd like to put my new microscope to good use.

  • @k1tajfar714
    @k1tajfar714 Před 18 dny

    thank you for the great video! i have actually zero bucks. i hope you can give me recommendations on this one. i currently have spent 200 bucks on a X99-WS motherboard so i'll have 4PCIe at full x16 if i dont hook any m.2 NVMEs i assume. so thats awesome, it also has a 10C/20Th Xeon low profile, 32GB ram, and a okayish CPU cooler. i have already saved 200$ more and i dont know what to do. i was going to buy one or two P40s and later upgrade to 4 of them. but now i cannot even afford one, they're there for almost 300 bucks im afraid. one option is to go with M40s but im afraid they're trash for LLMs and specifically for Stable Diffusion stuff. they're pretty old, although your video shows they're quite good. i'm lost i'd love to get help from you. if you thought you'd have time we can discuss it. i can mail you or anything you'd think is aapropriate. special thanks. K1

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      Feel free to reach out, tis a community! I have several M40's from when I first started down this road that I would be willing to part with.... it's a slippery slope

    • @k1tajfar714
      @k1tajfar714 Před 8 dny

      @@RoboTFAI you're fantastic! Thanks. I'd love to reach out. Would appreciate to have your email or something so i can discuss! Maybe we can make a deal on your M40s of you have any spare of them that u don't use? Thanks.

    • @RoboTFAI
      @RoboTFAI Před 6 dny

      @@k1tajfar714 robot@robotf.ai or can find me on reddit/discord/etc - though not as active as I would like to be.

  • @alzeNL
    @alzeNL Před 19 dny

    very interesting and great work !

  • @nickmajkic1436
    @nickmajkic1436 Před 22 dny

    Would you be able to make a tutorial on getting lovalAI working in kubernetes?

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      Sure, I think that's overdue at this point!

  • @mckirkus
    @mckirkus Před 22 dny

    What's the network bandwidth? I wonder what could be done if you connected to a bunch of buddies with gigabit symmetrical fiber connections.

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      As much as you can pump for distributing the model - during inference it's really only about 10-20 MB/s per node

  • @marekkroplewski6760
    @marekkroplewski6760 Před 23 dny

    Dad! Where did my gaming rig go!!! Now listen up there Junior, this is for science. Just don't tell your Mum. And you can have the car keys for Saturday.

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      Better than stealing their GPU's out of their rigs right? 😂

  • @nickmajkic1436
    @nickmajkic1436 Před 23 dny

    You probably have this in another video but what are you using for server monitoring in the background?

    • @RoboTFAI
      @RoboTFAI Před 23 dny

      I assume you are referring to Grafana (with Prometheus). Along with the DGCM exporter that is part of Nvidia GPU Operator for Kubernetes docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html

  • @Anurag_Tulasi
    @Anurag_Tulasi Před 23 dny

    It would be more intelligible if your results mention (Higher is better or Lower is better) beside the chart headings.

    • @RoboTFAI
      @RoboTFAI Před 23 dny

      Thanks for the feedback!

  • @_zproxy
    @_zproxy Před 23 dny

    wild. does this work for llava images too?

  • @tbranch227
    @tbranch227 Před 23 dny

    Congrats! You unlocked a masssive achievement running this on your own hardware!!! All hail the AI and the kilowatts we feed them

    • @RoboTFAI
      @RoboTFAI Před 12 dny

      Skynet is possible in your basement 🦾

  • @andriidrihulias6197
    @andriidrihulias6197 Před 23 dny

    First

  • @yvikhlya
    @yvikhlya Před 25 dny

    So, each time you add more resources to your system, you make it slower. This is pretty bad at all. Why bother adding more nodes, just run everything on a single node.

  • @TerenceGardner
    @TerenceGardner Před 25 dny

    I honestly think this is the coolest AI related channel on youtube. I hope it keeps growing.

    • @RoboTFAI
      @RoboTFAI Před 23 dny

      Wow thanks! Much appreciated

  • @jksoftware1
    @jksoftware1 Před 26 dny

    GREAT video... Learned a lot from this video. It's hard to find good AI benchmark videos on CZcams.

  • @LaDiables
    @LaDiables Před 27 dny

    This is neat However question, it appears to be slow for serialized prompts. Does sending parallel prompts/batched change the equation in terms of total tok/sec?

  • @blast_0230
    @blast_0230 Před 28 dny

    super video 👍 can you try some amd gpu like mi50 (120$) for 16Go vram and if you have budget the mi100 pls. i like this bench content

  • @testales
    @testales Před měsícem

    If I connect a worker it goes to CPU mode "create_backend: using CPU backend" - what am I missing? I've installed local-ai on both computers and I can do local inference (p2p off) on both and the GPUs are used in that case.

    • @RoboTFAI
      @RoboTFAI Před měsícem

      How are you running it? Just local-ai directly or through docker? You would still need to set the correct environment variables (and with docker pass in "--gpus all") when starting the worker. If you are using docker just for the workers/etc make sure you have the nvidia-container-toolkit installed also (pre-req for passing NVIDIA cards to docker containers). If all that is covered, I would need more info on the setup. Feel free to reach out and I can attempt to help

    • @testales
      @testales Před měsícem

      ​@@RoboTFAI Thanks for responding and let me say, that there are not many videos out about getting distributed inference running so your take on this is most welcome! I don't use docker and tried to keep it as simple as possible to reduce the amount of error sources. I just used to curl command on the very top of the installation page (curl http[:] localai io [/] install.sh | sh). This installs local-ai as service. So I changed and set the required environment variables in the service or the enviroment file (ADDRESS, TOKEN, LD_LIBRARY_PATH ) and installed the latest version of the Nvidia toolkit since my 12.4 and 550 driver were already too old and I got errors at first. Now I'm at toolikit 12.6 and driver version 560 and local inference works. So far I only tested with the Meta LLama 3.1 8b model in Q4, that can be installed directly via the web UI. I then enabled P2P and set the token on the server side in another environment variable, so it stays the same everytime. I created a worker also as service on my second machine to connect to the first using that token. The connection is sucessful and I can also do chats but only on CPU. I've then simplified it even more, disabled all services, switchted to the service user and ran local-ai (run with --p2p) as server on the main machine and another instance as worker on both machines, all in terminal sessions. Both workers connect but in CPU mode. I don't know if that is supposed to be that case but on the page in the screenshots you can see the same. What's in the log on your workers? I get someling like that: {"level":"INFO","time":"2024-08-10T16:46:54.792+0200","caller":"node(...)","message":" Starting EdgeVPN network"} create_backend: using CPU backend Starting RPC server on 127.0.0.1:44469, backend memory: 63969 MB There are no errors on any of the 3 running instances, the clients show the connection to the server instance and the server instance does server things. But it takes ages just to load the model and the inference is on CPU. None of the involved GPUs loads anything. Also I wonder how the model is supposed to work. I had expected that there must be a local version of it on the clients too but it doesn't seem to be the case. Yet transfering like 40-50GB of model data over the LAN each time you load a 70b model is very ineffcient. I couldn't find any documentation on this issue either. Edit: Reposted, seems mentioning a curl request is forbidden now...

    • @RoboTFAI
      @RoboTFAI Před 4 dny

      Yea I agree on the loading across the network being fairly inefficient but the tech is also really new still. As far as the setup goes do you have your model setup for GPU (gpu_layers, f16, etc)? localai.io/features/gpu-acceleration/

    • @testales
      @testales Před 3 dny

      @@RoboTFAI I had described every tiny bit in detail and even included some log messages but censortube deleted my answer. Twice. I didn't notice it did it again and since now month has past, I don't remember the details anymore. I've just checked that f16 is set in the yaml but I have not specified layers as the model should fit on either system's VRAM. It also runs in GPU model if I used it locally but remotely it connects only in CPU mode to the server system.

  • @marekkroplewski6760
    @marekkroplewski6760 Před měsícem

    The useful comparison would be to test llama3.1 8B against 70B and distributed 405B. Since you can already run a model, spreading it over more nodes is not usefull. So running a larger model distributed vs smaller model and comparing quality and inference speed is a usefull test. Great channel!

  • @JazekFTW
    @JazekFTW Před měsícem

    Can the motherboard handle that much power consumption from the 3.3V and 5V from the pcie slots for the gpus without powered risers or extra pci-e power like other workstation motherboards?

  • @Johan-rm6ec
    @Johan-rm6ec Před měsícem

    With these kinds of tests, 2 x 4060 ti 16gb must be included. And how it performs. 24gb is not enough 32gb on a Quadro kind of is 2700 euro"s. So it seems its a sweetspot. That you shpuld cover. Know your audience know sweetspots and that are the video's people want to see.

    • @RoboTFAI
      @RoboTFAI Před měsícem

      Adding in 2x 4060's won't really increase the speed over 1 of them, at least not noticeable. There is some other videos on the channel addressing this topic a bit. Scaling out on # video cards is really meant to just gain you that extra VRAM. So it's always a balance of your budget, costs, power usage, and your expectations (this is the more important one). Lower, lower your expectations until your goals are met! haha

  • @Johan-rm6ec
    @Johan-rm6ec Před měsícem

    What i would like to know 2 times a 4060 ti 16GB. Is it more usable with LM Studio and various models or is a 4070 ti 16gb super a better option? Cost is the same here around 900 euro's.

  • @senisasAs
    @senisasAs Před měsícem

    As you asked :) it would ne nice to see HOW-TO. Really nice content and topic itself 👍

    • @RoboTFAI
      @RoboTFAI Před měsícem

      Awesome, thank you! How to's coming eventually gotta find the time which I have none of!

  • @steveseybolt
    @steveseybolt Před měsícem

    why??????