Building a GPU cluster for AI

Sdílet
Vložit
  • čas přidán 24. 06. 2024
  • Whitepaper: lambdalabs.com/gpu-cluster/ec...
    Learn, from start to finish, how to build a GPU cluster for deep learning. We'll cover the entire process, including cluster level design, rack level design, node level design, CPU and GPU selection, power distribution, storage, and networking.
    This talk is based on the Lambda Echelon GPU Cluster whitepaper. The whitepaper can be found above.
    Slides for the talk can be found here:
    files.lambdalabs.com/How%20to%...
    Errata:
    - Slide 46 contains an erroneous diagram with a connection from the storage server to the compute fabric network, the storage server does not connect ot the compute fabric network. The correct diagram is available in the whitepaper.
  • Věda a technologie

Komentáře • 48

  • @randahan215
    @randahan215 Před 10 měsíci +11

    Extraordinary presentation. Covered all the important topics in depth and with real teaching talent. Many thanks!!

  • @dr.mikeybee
    @dr.mikeybee Před 8 měsíci +3

    Thank you. You got me started years ago with your lambda stack -- the only way I could get TensorFlow installed on Linux.

  • @randahan215
    @randahan215 Před 5 měsíci +1

    Most professional and holistic explanation I heard about this topic.
    Thank you so much!!

  • @yassinebouchoucha
    @yassinebouchoucha Před rokem +3

    Thank you for highlighting an underrated topic/options that company should re-consider within their compute infrastructure.

  • @NSPK-
    @NSPK- Před 9 měsíci

    Very expert suggestions for hpc and compute sizing.

  • @ilyboc
    @ilyboc Před 8 měsíci

    Really good analysis and presentation!

  • @brianwesley28
    @brianwesley28 Před 3 lety +4

    Thanks for the video.

  • @carlschumacher5510
    @carlschumacher5510 Před 2 lety +19

    Its nice to see a holistic explanation of designing / building / installing a complex multi-rack system...As someone that has spent years working on both sides of the "analog/digital divide" (physical data center world / digital world's various segments), the un-sexy physical aspects of available rack space / power / cooling / floor loading / network uplink bandwidth are often overlooked (often assumed)...A semi arrives with a pallet: "Hey Carl, you can have this online in a couple days, right?"

    • @lambdacloud
      @lambdacloud  Před 2 lety +6

      Hey Carl, thanks for the kind comment. Glad you like the video. It's always funny how difficult it can be to 'bridge the divide' between the physical world and virtual world. Many SWEs expect to be able to "spin up" 1000 servers with an API call and forget that there are actual physical objects and tons of people that actually make that happen when you're on-prem.

  • @peterxyz3541
    @peterxyz3541 Před rokem +19

    Thanks. I’m planning on building a “massive” 2 GPU system for home use.

    • @fundoo203
      @fundoo203 Před 8 měsíci +3

      How did it go man? I also want to build something like that and then stumbled on this video, which is excellent

  • @sanaullah-qureshi
    @sanaullah-qureshi Před rokem +2

    very informative , thank you.

  • @user-us7oi6jw5i
    @user-us7oi6jw5i Před 4 dny

    Really complete, thank you!

  • @ProjectPhysX
    @ProjectPhysX Před 2 lety +22

    Lots and lots of A100 GPUs. Every single one of them is a monster, almost 2x faster memory than the next best GPU. An entire room full of A100 racks... holy cow.

  • @anatolystrashkevich7621
    @anatolystrashkevich7621 Před rokem +1

    very informative, thanks!

  • @austynr
    @austynr Před rokem +1

    Genius bait and switch. Props!

    • @metal_mo
      @metal_mo Před rokem

      Lambda needs an explanation on the difference between "building" and "designing".

  • @vtrandal
    @vtrandal Před 2 lety +2

    Excellent.

  • @thePyiott
    @thePyiott Před 2 lety +1

    Great insight!

  • @uzairqarni7782
    @uzairqarni7782 Před 5 měsíci

    This was amazing. Thank you.

  • @natexetan5732
    @natexetan5732 Před 3 lety +1

    thanks for the inspiration

  • @cyberspider78910
    @cyberspider78910 Před měsícem

    Highly appreciated...CZcams should have a separate category called Founder's video.

  • @HarishN.J
    @HarishN.J Před měsícem

    Hey Stephen, this is highly informative. I work on this clustering. Now am able to connect the dots and get the bigger picture.
    where can i read about the relationship between numa topology and GPU peering capability.

  • @glennisholcomb592
    @glennisholcomb592 Před 10 měsíci +1

    I have three computers, and a nas, and a external hub. I think that I don’t need a another server because of the NAS. As far as my architecture goes, is there anything else that you can advise?

  • @petevenuti7355
    @petevenuti7355 Před 10 měsíci

    What if I have a model that I just want to run as provided, it hasn't really been optimized to run around the cluster and has memory requirements greater than any individual system I have. I feel safe to assume that for that specific case a shared distributed memory model would be the solution to run that specific app, yes? Is there any distribution of Linux that has support for such a memory model? It doesn't have to be a full-blown single system image. Perhaps a patch to the memory management driver so storage can be treated as an extension of system memory and not swap memory?
    Does any such software exist?

  • @eyadmufti
    @eyadmufti Před rokem

    it is a lecture more than a tutorial, Thx.

  • @Bloodycub666
    @Bloodycub666 Před rokem

    I just love this kind things. How do i can start this kind bussnes how i can find customer for like small node and start building up

  • @loadmastergod1961
    @loadmastergod1961 Před měsícem

    I want to build a multi dual epyc 7742 based system for goofing around learning this stuff.

  • @programmingwiththotho4641
    @programmingwiththotho4641 Před 5 měsíci

    Your are insane, thank you

  • @ikbo
    @ikbo Před 2 lety

    Do you guys have a gpu cluster optimized for 3d rendering.

  • @chaoticblankness
    @chaoticblankness Před 5 měsíci

    Very Based

  • @rosenangelow6082
    @rosenangelow6082 Před 8 měsíci +1

    Tell me how difficult it is so i can buy your solution kind of talk

  • @nathanthomas9395
    @nathanthomas9395 Před 2 lety

    Does lambda products (gpu cluster) ship with a manual to help you set up the servers for use

  • @mengxu2026
    @mengxu2026 Před 3 lety +2

    Our group ordered around 10 lambda PCs 1 year ago. Right now more than 5 have problems. Some of them do not start up. Mine gets stuck randomly....

    • @yugr
      @yugr Před 3 lety

      Have you tried looking into the reasons?

    • @lambdacloud
      @lambdacloud  Před 3 lety +3

      Meng Xu, you can email support@lambdalabs.com 24/7 or call +1 (866) 711-2025 during business hours. Sorry to hear you're having issues, I'm sure we'll be able to resolve them quickly.

    • @danielleza908
      @danielleza908 Před rokem +1

      Our team has 5 lambda laptops, they work perfectly for over a year now..
      We also have a workstation with 3 GPUs, works great too.

  • @jleonardoperez5402
    @jleonardoperez5402 Před 3 měsíci

    Looking for work would love to help

  • @ravnodinson
    @ravnodinson Před 9 měsíci

    Hell yes Lambda Lambda Lambda.

  • @meng-hub
    @meng-hub Před 11 měsíci

    Does it work in man????

  • @julianfiacconi709
    @julianfiacconi709 Před rokem

    Still most relevant today, 2 years later. Thanks.

  • @JustPlainRob
    @JustPlainRob Před 5 měsíci

    Now if only I was a billionaire so I could make use of this great information...

  • @thinkinginsomething1859
    @thinkinginsomething1859 Před 10 měsíci

    Half Life man!

  • @huaveihuavei1045
    @huaveihuavei1045 Před 3 lety

    headeggs

  • @harshikamahesh9459
    @harshikamahesh9459 Před 2 měsíci

    Talk about what ur expert.. don’t talk useless stuff without knowing all facts

  • @mikepict9011
    @mikepict9011 Před rokem

    This dudes in full submission mode . Sad

  • @orthodoxNPC
    @orthodoxNPC Před 2 lety

    speak UP