MI210s vs A100 -- Is ROCm Finally Viable in 2023? Tested on the Supermicro AS-2114GT-DNR

Sdílet
Vložit
  • čas přidán 16. 07. 2023
  • Wendell discusses the race in machine learning, going over Google's, Nvidia's, and AMD's tech to see who's got what in 2023.
    *********************************
    Check us out online at the following places!
    bio.link/level1techs
    IMPORTANT Any email lacking “level1techs.com” should be ignored and immediately reported to Queries@level1techs.com.
    -------------------------------------------------------------------------------------------------------------
    Intro and Outro Music: "Earth Bound" by Slynk
    Edited by Autumn
  • Věda a technologie

Komentáře • 242

  • @kazriko
    @kazriko Před 11 měsíci +294

    AMD needs two modes, "Accurate" and "Green Team Inaccuracy"

    • @TheHighborn
      @TheHighborn Před 11 měsíci +18

      More like, red dot accurate, noisy green data

    • @sinom
      @sinom Před 11 měsíci +35

      Calling it "green mode" and justifying it as "oh it uses less power because it is less accurate" might actually be something they could do

    • @ac3d657
      @ac3d657 Před 6 měsíci

      You need two mode... wrapped, and dropped

  • @TAP7a
    @TAP7a Před 11 měsíci +118

    ROCm seems to have planted itself in the scientific HPC world, let’s hope it can grow from there

    • @tappy8741
      @tappy8741 Před 11 měsíci +12

      With CDNA yes. With RDNA1/2/3 they've severely dropped the ball and didn't adequately make it clear that that was the plan all along. On the consumer side which is where hobbyist compute lives the 6950X was the first card to approach the Radeon VII for a traditional (non-AI ML whatever) scientific workload. The 7000 series is actually worse as they cut FP64 performance and the memory model with infinity cache split 5/6 ways (and/or something else) seems to have hurt this specific (opencl which is why it can be tested) workload.
      George Hotz to the rescue would be awesome.

  • @jannegrey593
    @jannegrey593 Před 11 měsíci +47

    Well - I hope for some competition. Standards are fine, but one company owning them is very monopolistic. And AMD's disadvantage seemed to be lack of software rather than hardware.

  • @flamingscar5263
    @flamingscar5263 Před 11 měsíci +60

    Honestly, im hopeful for ROCm on consumer hardware soon, and windows, if your someone that uses any form of creative app like blender or the adobe suite then you know how valuable CUDA is, this really could be the boost AMD needs, Ive been trying my best to recommend AMD but its surprising how many people go Nvidia because of how much better Nvidia is in creative apps, even if they dont use them, its always "well I might want to use them in the future so Ill just go Nvidia"
    Soon there will be little exscuse to not go AMD, and im all for it, competition is good, not that im in any way an AMD fanboy, I knoe for a fact that if somehow AMD dethroned Nvidia as the market lead they would pull the same shit Nvidia does, but competition is what is meant to stop that

    • @GlacikingTheIceColdKing
      @GlacikingTheIceColdKing Před 11 měsíci +6

      funny enough, they most likely won't use them in the future. I've seen a lot of people using the same argument to go Nvidia but they don't even install any creative applications after buying their gpus.
      Also I've been using AMD for about 7 months, it isn't necessarily horrible for people who want to just do video editing with premiere pro and illustrator or photoshopping. I use those softwares almost regularly and I face no problem with it.

    • @reekinronald6776
      @reekinronald6776 Před 10 měsíci +3

      Yup. For about a decade I was scratching my head why AMD had such a lousy Software strategy. It had great hardware, but the drivers and the lack of tools or API for programmers just seemed like a huge business mistake. Perfect example was the time and resources spent on AMD's Prorender. Considering the multitude of professional and high quality open source renders, ProRender was a pointless exercise; better to spend the man power and money on driver development, or even on openCL when it was viable.
      At least with ROCm they now seem to understand that everything that is needed to support the hardware is as important as the hardware itself.

  • @dirg3music
    @dirg3music Před 11 měsíci +41

    I completely agree, if history has shown is anything it's that when Lisa Su goes all in on something, that something tends to work and work well. I'm just excited to see the market get more diverse as opposed to "CUDA or gtfo", closed ecosystems like that are bad for everyone.

    • @psionx1
      @psionx1 Před 11 měsíci +3

      except it was AMDs own fault that cuda became the standard for GPU compute work and they still have not learned adding features to hardware and slapping them on the box is not enough to win. they actually have to provide support and funding to develop 3rd party software that uses the features of the hardware.

    • @makisekurisu4674
      @makisekurisu4674 Před 10 měsíci +2

      ​@@psionx1Give them a break, they are running on less than half the so of course they'd have to pick and choose their fights

  • @datapro007
    @datapro007 Před 11 měsíci +47

    I hope to heck it is. It's NVidia or nothing until now. Terrific video Wendell. I like it that you have content for the working folks.

  • @Bill_the_Red_Lichtie
    @Bill_the_Red_Lichtie Před 11 měsíci +20

    I am such a geek, "Can't believe it's not CUDA" made me actually laugh out loud.

  • @nickelsey
    @nickelsey Před 11 měsíci +74

    Tensorflow never directly competed with CUDA, it sits on top of CUDA - Tensorflow's primary competitor was (and still is) Pytorch. Both Tensorflow and Pytorch can be run on TPUs, but of course Tensorflow has 1st class support. Both Tensorflow and Pytorch have 1st class support for CUDA. I suspect the real reason Tensorflow hasn't been as popular lately is two-fold. First, a lot of internal Google development resources have moved on to develop JAX instead of TF, and secondly (and more importantly), Pytorch is simply better than Tensorflow. Its significantly more enjoyable and easier to use. And the reason CUDA has beaten out TPUs is also simple - you can only get TPUs using Google Cloud, whereas every cloud, every enterprise datacenter, and every school had direct access to CUDA capable devices. Everyone uses and develops for them, whereas TPUs and the XLA compiler basically only developed by Google.
    Also, in deep learning we actually don't mind the reduced accuracy for many problems. In fact, a mix of 32 bit and 16 bit is the *default* data format for deep learning now. Reduced precision deep learning is extremely important for large scale neural network development - for three reasons. First, obviously, if you use fewer bits for your model, you can fit a larger model in a single GPU's memory, which makes development easier. Second, the Tensor Cores basically double their FLOPs every time you halve the precision of your data. So if you have 256 TOPs using 32 bit floating point data, then you have 512 using FP16 data, and 1024 TOPs using FP8 data. Even further compression work is being done for INT8 and even INT4. Finally, one of the most important and oft-overlooked issues is that many neural net architectures require very high GPU memory bandwidth - thats why data center GPUs use HBM. When you reduce your data from 32 bit to 16 bit floats, you reduce the memory bandwidth pressure by half.
    We won't consider AMD cards until they're competitive at FP16 performance with CUDA, and even then, AMD would REALLY need to convince us that their software stack works as seamlessly as CUDA does - you have to add wasted developer and data scientist time to the total cost of the device to get a proper apples-to-apples comparison. We just started getting our H100 deliveries in, and they are truly beasts. I'm hoping we can get some AMD hardware in for benchmarking at some point soon.

    • @nexusyang4832
      @nexusyang4832 Před 11 měsíci +8

      Pin this comment above.

    • @seeibe
      @seeibe Před 10 měsíci +8

      It all sounds viable from the hobbyist / small company standpoint. But come on, if you can afford H100s, you're big and successful enough that you can just invest in AMD as a backup plan. This would basically be the equivalent of Valve saying "All PC gamers are on Windows, so we won't invest in Linux". At a certain point, you're the one who has to make it happen.

    • @GeekProdigyGuy
      @GeekProdigyGuy Před měsícem

      Reduced precision is NOT the same as violating the FP standards. Going from FP32 to FP16 is a reduction in precision, but if the hardware implements the standards correctly, an FP16 calculation should have the exact same result no matter what card you run it on. Fudging the calculations probably doesn't make a huge difference for most ML applications, but for companies that need auditability (eg finance) or even big tech companies that want to debug an issue affecting a million users out of their billion users... Standards compliance is important, and Nvidia needs to fix their shit.

  • @ProjectPhysX
    @ProjectPhysX Před 11 měsíci +22

    We have both an MI210 64GB and A100 40GB for my FluidX3D OpenCL software. Both cards are fine, the software runs flawless, but they are super expensive. Value regarding VRAM capacity is better for the MI210, yet performance (actual VRAM bandwidth) is better on the A100. Somehow the memory controllers on AMD cards are not up to the task, 1638 GB/s promised, 950-1300 GB/s delivered. The A100 does the actual 1500 GB/s. Compute performance for such HPC workloads is irrelevant, only VRAM capacity and bandwidth counts.

    • @mdzaid5925
      @mdzaid5925 Před 9 měsíci +3

      What a time we are living in.... ~1000Gbps is not enough 😅

    • @ProjectPhysX
      @ProjectPhysX Před 9 měsíci +3

      @@mdzaid5925 crazy right? Transistor density and with it compute power (Flops/s) has grown so fast in the last decade that memory bandwidth cannot keep up. Today almost all compute applications are bandwidth-bound, meaning the CPU/GPU is idle most of the time waiting for data. Even at 2 TB/s.

    • @mdzaid5925
      @mdzaid5925 Před 9 měsíci

      @@ProjectPhysX True..... not sure about performance implications but computing has evolved very very rapidly. When I think how small each transistor it, how many and how closely they are packed, it feels impossible. Personally, I feel that eventually analog neural networks will take over and gpu's dependency should be reduced to only training / assisting the analog chipsets. Also, I don't have too much faith in current generation of "AI" 😅.

    • @Teluric2
      @Teluric2 Před 3 měsíci +1

      What kind of setup you use with this software? Windows redhat?

    • @ProjectPhysX
      @ProjectPhysX Před 3 měsíci

      @@Teluric2 for these servers openSUSE Leap, for others mostly Ubuntu Server minimal installation.

  • @s1ugh34d
    @s1ugh34d Před 11 měsíci +29

    We need more high end AI comparisons like this. Hope you get more gear to test!

  • @leucome
    @leucome Před 11 měsíci +34

    I got a 7900xt when rocm5.5 went out. Specifically to use with A1111. It works pretty good. To give an idea I tried 32 image of Dany Devito 768px 20samples,it took 2:30 min. Though I did 8x4 batch if I do 16x2 take 2:40 for 32x1 then it took 3 min. SO yeah the performance is there. I can just imagine how fast the MI300 will be.

    • @sebastianguerraty6413
      @sebastianguerraty6413 Před 10 měsíci +1

      I thought Rocm was only suported in very few 6xxx gpus on AMD and their server class gpus

    • @chrysalis699
      @chrysalis699 Před 10 měsíci +5

      @@sebastianguerraty6413 Rocm 5.5 fixed that. Added gfx1100 and thus the 7xxx support. I've been custom compiling pytorch with every new release of rocm. Can't wait for them to start leveraging the AI accelerators cores in the 7xxx series. Weather that is CUDA compatible, and will be exposed via HIP still needs to be seen.

    • @sailorbob74133
      @sailorbob74133 Před 10 měsíci

      @@chrysalis699 when you compile pytorch for gfx1100 how much of a uplift do you get over stock pytorch? What benefits do you see from the custom compile in general?

    • @chrysalis699
      @chrysalis699 Před 10 měsíci +2

      @@sailorbob74133 The stock pytorch compiled against rocm 5.4.2 doesn't detect my card at all, so the uplift is infinity 🤣. I doubt it there is much difference for RX 6xxx cards, and there is still quite a bit of unlocked potential in the RX 7xxx cards, as I haven't seen any HIP APIs for the AI accelerators. There are actually barely any mention of them on AMD's site, just an obscure reference on the RX 7600. Probably have to wait for CDNA 3 to release those APIs.

    • @chrysalis699
      @chrysalis699 Před 10 měsíci

      I just noticed that pytorch nightly is now compiled against ROCm 5.6, so I'll probably just switch to those. 🤞the next release will be build against 5.6

  • @astarothgr
    @astarothgr Před 11 měsíci +12

    The worst thing about ROCm is the hit'n'miss support for commodity GPUs. Back in the 3.x / 4.x days of ROCm, commodity GPUs were half-heartedly supported, with bugs, and sometimes support retroactively withdrawn. These days at least they tell you that if you buy anything other than the W-series of GPUs (i.e. W6800) they don't promise anything.
    This however, will not increase the mind share; all students and budget-strapped researchers just buy off-the-self nvidia GPUs and go to work. If you've picked a commodity GPU card and are trying to get ROCm to work, be ready for tons of frustration; really, this use case is unsupported.
    Source: my own experience with ROCm 4.x, using rx480/580, vega 56/64 and Radeon VII (the only one that worked reasonably well).

    • @mytech6779
      @mytech6779 Před 11 měsíci +4

      I would add to the student/budget research thing, they may not be looking for high performance, but they do need the full feature set to do the primary development work, then once working and somewhat debugged they will upgrade to get performance.
      Even for big-budget ops it makes no sense to have top-end hardware sitting there depreciating for a year or four while the dev team runs experimental test builds. By the time it comes to a real production run another purchase will be needed anyway.
      That core functionality problem has always been AMDs GPU problem, promises that seem good on paper but ultimately don't deliver. "Oh yeah now that we have your money it turns out you need this specific version of PCIx with that CPU subfamily, on these motherboards with this narrow list of our cards (as we have terrible product line numbering so many in the same apparent series don't work) made in these years, with that specific release of this OS...."
      Years ago I bought a W7000 (well over $1000 12 years ago) specifically because I wanted to play with the compute side, and there were claims that it had compatible drivers and such (I use Linux, nVidia had terrible support), Nah oops something in the GCN1.x arch was screwed up and compute was never useable even after several major changes in drivers and supposed open sourcing. It worked OK for graphics but my graphics needs are minimal.
      Later, I switched to a much newer and cheaper equivalent performance consumer AMD card that claimed OpenCL support, nah again doesn't really.
      Gave me a rather bad taste for AMD. I'm hoping Intel can push some viable non-proprietary alternative to CUDA, I'm due for a new system in the next couple years.

  • @solidreactor
    @solidreactor Před 11 měsíci +23

    Rumor says that ROCm might work for RDNA3 on Windows this fall (repo & comments). However something similar was said earlier for 5.6 and that might not be true anymore?
    I really hope the consumer RDNA cards could run ROCm on Windows and act both like an evaluation for the CDNA platform and as an entry for AI compute, to democratize AI access.
    Having ROCm support on consumer cars on Windows might also develop traction from other companies (like Tiny corp) to embrace the more open solution, who knows, maybe that will tip the scale to AMDs favor?

    • @flamingscar5263
      @flamingscar5263 Před 11 měsíci +2

      Everything points towards it being the case, AMD hasn't said anything offically but the in development documents leaked saying a fall time frame
      It will happen eventually, even if not fall, it will happen, AMD knows how far behind they are on the consumer side for creative work, they need this

    • @stevenwest1494
      @stevenwest1494 Před 11 měsíci +5

      I'm hanging my GPU choice on this date, because honestly I don't want a rtx 3060 12gb, and NGreedier's horrible GeForce experience 🤮 but I want to get into Stable diffusion. A 3080 12gb is just waaaay too much still! But I really want is a rx 6800, with RocM for Windows!

    • @eaman11
      @eaman11 Před 11 měsíci +2

      Intel says the same thing, their stack working on Windows too.

    • @mytech6779
      @mytech6779 Před 11 měsíci +2

      AMD will see a squirrel by then and abandon yet another project with half implemented "support". Why they would even mess with Windows support at this point is dumbfounding, most systems in this realm run Linux unless they are forced to Windows by some 3rd party need for proprietary crap. Windows may still be king of Ma and Pa Kettle's desktop but that isn't this target market segment.

    • @reekinronald6776
      @reekinronald6776 Před 10 měsíci +1

      @@mytech6779 I would like to see a segment breakdown between corporate GPU computing and consumer. I would still think Windows Users running blender, Adobe, or some other video graphics program that use GPU rendering is quite large.

  • @tad2021
    @tad2021 Před 11 měsíci +8

    If you didn't know, in A1111, change the RNG source from GPU to CPU and the optimizer to sdp-no-mem. That should make the differences running on different GPUs as little as possible.
    Using xformers on cuda can be faster (sdp on pytorch2 has mostly caught up), but the output isn't deterministic.

  • @bennett5436
    @bennett5436 Před 11 měsíci +7

    please do 'tech tubers by Balenciaga' next

  • @P0WERCosmic
    @P0WERCosmic Před 6 měsíci +1

    ROCm 6.0 just dropped today! Love for you Wendell to do an update on this video to show off all the advancements with 6.0 and if there are any noticable performance bumps 🙏

  • @MaxHaydenChiz
    @MaxHaydenChiz Před 11 měsíci +40

    It'd be easier to get students experienced with AMD hardware and get open source support for it, if RDNA had more compatibility with CDNA / better performance parity against NVidia hardware.
    Students and hobbiests aren't spending 10+k on this kind of stuff.

    • @nexusyang4832
      @nexusyang4832 Před 11 měsíci +16

      Yeah, the fact someone can walk into best buy and get a prebuilt and download cuda sdk and learn says a lot on how easy and affordable someone can get into AI/ML. If AMD can do the same for their consumer/gaming hardware then that would be a big game changer.

    • @levygaming3133
      @levygaming3133 Před 11 měsíci +15

      @@nexusyang4832exactly. There’s a lot of hand wringing about all the various things Nvidia does to needlessly segment their lineup, and that’s all well and good, but that’s not at all what CUDA is.
      CUDA’s advantage is that it’s the same CUDA wether you have an MX iGPU replacement, the same CUDA that’s in your old Nvidia GPU that you’re replacing (assuming you have an Nvidia gpu, obviously) and it’s the very same CUDA that’s in last year’s laptops, this years laptops, and is certainly going to be in next year’s laptops.
      It’s not like AMD makes CDNA laptops, and that’s kinda the point.

    • @nexusyang4832
      @nexusyang4832 Před 11 měsíci +1

      @@levygaming3133 You're spitting facts. 👍👍👍👍

    • @steve55619
      @steve55619 Před 11 měsíci

      Excuse me??? Lol

    • @mytech6779
      @mytech6779 Před 11 měsíci +2

      Hobbiest /student stuff doesn't need performance parity with CDNA.
      What it needs is ease of access (Availible as a standard feature on commonly availible consumer priced cards, without hobbling); similarity of interface accross products for the user and for software portability between consumer stuff and CDNA; and performance that is good enough to not be frustrating.
      Reasonable Linux support is also needed. Linux may only make up 2% of total desktops, but Ditzy Sue and Joe Sixpack aren't GPU-compute hobbyists, so total desktops is the wrong stat; in reallity Linux is closer to 50% or more of the relevent market segments.

  • @joshxwho
    @joshxwho Před 11 měsíci +2

    Thank you for producing this content. As always, incredibly interesting

  • @AndreiNeacsu
    @AndreiNeacsu Před 11 měsíci +14

    I am really happy that Ryzen paid off. in 2017 I was one of the earliest adopters who pre-ordered two Ryzen 1700 (non X) systems with X370 boards; and I never pre-order stuff, did not before and have not since. Now, AMD is a proper force for innovation and competition in both the GPU and GPU spaces, for consumers and datacenters. Also, Intel ARC seems to become more interesting by the day. Got an Acer A770 16GB as a curiosity at the start of this year and I still haven't reached the final conclusions about it; seems like every second driver update makes things better.

    • @flamingscar5263
      @flamingscar5263 Před 11 měsíci +8

      Yea, it's honestly good ryzen happened, because there was reports they were on the road to bankruptcy
      All of this is thanks to Lisa Su, she really saved AMD

    • @peterconnell2496
      @peterconnell2496 Před 11 měsíci +2

      Well done. Therein lies a tale many of us would like to hear. The buying decision in the market of the day? The cost of an 8 core intel vs amd then e.g.? Lets not forget what a classic the 1600 proved to be.

    • @MatthewSwabey
      @MatthewSwabey Před 11 měsíci +5

      According to two senior AMD tech folks Zen was designed because they had to! bulldozer/etc. was a failure. Originally they aimed for 70% of Intel performance for 50% of the price, but then TSMC's silicon just kept getting better and Intel stopped innovating. [I had the chance to talk to some senior AMD tech folks when they were recruiting on campus and they were surprised how great Zen turned out too!]

  • @mrfilipelaureanoaguiar
    @mrfilipelaureanoaguiar Před 11 měsíci +5

    That m.2 scanning multiple 4k videos to check for a choice of shape...really Nice what it can process and check at that size without cooling on it. As long is detected...

  • @steve55619
    @steve55619 Před 11 měsíci

    Thanks for this video, this field is moving so quickly it's really hard to keep up to date on the latest advancements, let alone the current status quo

  • @usamaizm
    @usamaizm Před 11 měsíci +3

    I think the subtleties shouldn’t be an issue.

  • @spuchoa
    @spuchoa Před 11 měsíci

    Great video Wendell!. This is good for the market, lets hope that the prices adjust in the next 12 months.

  • @reto
    @reto Před 11 měsíci +6

    Got SD A1111 to work on an RX 6500 XT and an Arc A770. But I wasn't able to run it on Vega iGPUs. The A770 16GB crushed the 3060 12GB I usually use.

    • @littlelostchild6767
      @littlelostchild6767 Před 10 měsíci

      hey, if you don't mind , could you please make a short test video on a770.? I'm thinking getting a770

  • @marktackman2886
    @marktackman2886 Před 11 měsíci +2

    These videos empower my team to express ideas to upper management.

  • @SomeGuyInSandy
    @SomeGuyInSandy Před 11 měsíci +5

    Seeing those giant GPU modules gave me Pentium II flashbacks, lol!

  • @Alice_Fumo
    @Alice_Fumo Před 11 měsíci +1

    This is such a curious way to create spot the difference images.

  • @Owenzzz777
    @Owenzzz777 Před 11 měsíci +12

    You forgot to mention George Hotz’s discussion started with his frustration with AMD GPU. The so called “open source” software isn’t so open. Look at the “open” FSR 2 repo, no one is reviewing public pull requests, it’s used more as a marketing tool than supporting OSS community

    • @tstager1978
      @tstager1978 Před 11 měsíci +4

      They never said that fsr2 would be an open source project. They said it would be open source meaning free access to source code and the ability to modify for your own needs. They never said they would accept pull request from the public.

  • @post-leftluddite
    @post-leftluddite Před 11 měsíci +10

    Wendell....this is seriously important work. Making the alternative to what many see as the default choice observably feasible is crucial to easing the hesitancy many people have, and just like in anything else [under the clutches of capitalism] a defacto monopoly can only harm consumers/users.

  • @cedrust4111
    @cedrust4111 Před 11 měsíci +3

    Is ROCm supported on RDNA3 IGPU?
    By that i mean if one has a Minisforum UM790 Pro (with Ryzen9 7940HS) can that work?

  • @methlonstorm2027
    @methlonstorm2027 Před 11 měsíci

    i enjoyed this thanks you.

  • @jadesprite
    @jadesprite Před 11 měsíci +3

    But what I really want to know is, can I use it to TRAIN models too?? Esp on voice and faces, I don't want to upload my family's private data to a cloud service and potentially have them save it forever, I would only trust that locally.

  • @zachnilsson4682
    @zachnilsson4682 Před 11 měsíci +2

    I'm going to Argonne National Lab later this week. Let me know if you want to sneak into the new super computer there ;)

  • @mdzaid5925
    @mdzaid5925 Před 9 měsíci +2

    ROCm support is definitely need on consumer grade hardware.
    -This will give AI students some experience in Amd ecosystem.
    - Also, not all AI models run on the cloud. For local use, the companies have to consider the available options and currently it's only nvidia.

  • @Stealthmachines
    @Stealthmachines Před 9 měsíci

    You're simply the best!

  • @shauna996
    @shauna996 Před 10 měsíci

    Thanks!

  • @Icureditwithmybrain
    @Icureditwithmybrain Před 11 měsíci +2

    Will ROCm permit me to leverage my AMD 7900 XTX for accelerating the locally executing personal AI LLM on my PC? Presently, it operates on my CPU, causing sluggish responses from the LLM.

  • @wsippel
    @wsippel Před 11 měsíci +12

    I run AI workloads on a 7900XTX. It's a bit of a headache sometimes, but it works. But there's so much performance left on the table. I recently played around with AMD's AITemplate fork, and it's really fast on RDNA. But it's also incomplete and unstable. Triton recently got lots of MFMA optimizations, no WMMA though. They're largely the same thing as far as I understand, except MFMA is Instinct, WMMA is Radeon. I think even most AMD engineers don't realize Radeon has 'Tensor Cores' now.

    • @whoruslupercal1891
      @whoruslupercal1891 Před 11 měsíci +2

      >They're largely the same thing as far as I understand
      Absolutely not, MFMA is 1 clock whatever matrice size MMA, WMMA is just running wave64 in however many clocks on double the SIMD width.

    • @wsippel
      @wsippel Před 11 měsíci +1

      @@whoruslupercal1891 Maybe, but the instructions are mostly the same, no? And WMMA on RDNA3 is actually accelerated (CDNA2, CDNA3 and RDNA3 are the only three architectures supported by rocWMMA, so I assume previous RDNA chips simply didn't have an equivalent), so AMD should probably use those instructions wherever possible.

    • @whoruslupercal1891
      @whoruslupercal1891 Před 11 měsíci

      @@wsippel >but the instructions are mostly the same, no
      no.
      >CDNA2, CDNA3 and RDNA3 are the only three architectures supported by rocWMMA
      Yea but MFMA is different.

  • @ddnguyen278
    @ddnguyen278 Před 11 měsíci +9

    Kinda hard to build for determinism when your hardware does lossy stochastic compression on compute.. Even multiple runs of the same data set wouldn't result in the same output on Nvidia. I suspect if the didn't do that they would be significantly slower.

  • @KeithTingle
    @KeithTingle Před 11 měsíci +1

    love these talks

  • @jpsolares
    @jpsolares Před 11 měsíci +1

    There is a tutorial for amd instict and stable difusion? thanks in advance.

  • @jp-ny2pd
    @jp-ny2pd Před 11 měsíci

    I always spun that technical difference as a "One is a more mature, but less complete offering." So then it became a question of what is good enough for their needs.

  • @dmoneyballa
    @dmoneyballa Před 11 měsíci +2

    where do you find the model used? I can't find where it is in hugging face. icantbeliveitsnotphotography safe tensors that is.

    • @wargamingrefugee9065
      @wargamingrefugee9065 Před 11 měsíci +2

      Maybe this, Google: civitai ICBINP - "I Can't Believe It's Not Photography". I'm downloading it now. Best of luck.

  • @outcastp23
    @outcastp23 Před 11 měsíci +1

    Thanks for the stock tip Wendell! I'm selling all my TSLA and buying up AMD stock.

  • @b.ambrozio
    @b.ambrozio Před 4 měsíci

    Well, why we don't have it on AWS, or GCP? I'm really looking forward to seeing it.

  • @sinom
    @sinom Před 11 měsíci +1

    I was waiting for this video since the teardown came out

  • @tad2021
    @tad2021 Před 11 měsíci +2

    We've been using a lot of TPU the past few months. It's such a weird platform with interesting self-imposed bottlenecks, and doesn't help that Google will suddenly reboot or down our nodes for maintenance at least once or more times every few days without any warnings.

  • @dholzric1
    @dholzric1 Před 11 měsíci +1

    Is there any way to get the new version of rocm to work with the mi25?

  • @AI-xi4jk
    @AI-xi4jk Před 11 měsíci +8

    Appreciate the work you’ve put into this Wendel. I think AMD needs to support not only frameworks like TF and Torch but also model conversion from one framework/hw to another. Basically the primitives mapping between systems.

  • @callowaysutton
    @callowaysutton Před 11 měsíci +1

    Did you get to test out running LLMs on these GPUs? I'd be curious how many tokens per second these bad boys can push out, especially since it seems like LLMs are going to be a main point of interest for AI companies for at least the next 1-3 years.

  • @DOGMA1138
    @DOGMA1138 Před 11 měsíci +1

    I'm pretty sure your running torch with cu117 or older, the numbers are about 70% lower than what an A100 puts out with these settings on cu118.... if you did just pip install form the default repo it's cu117.

  • @cromefire_
    @cromefire_ Před 11 měsíci +1

    One big problem for Google was that you only get full TPUs in Google Cloud, otherwise it'd be pretty different.

  • @CattoRayTube
    @CattoRayTube Před 11 měsíci

    Big fan of Evelon Techs

  • @VFPn96kQT
    @VFPn96kQT Před 11 měsíci +5

    Hopefully SYCL will abstract platform specific APIs like ROCm/CUDA etc.

    • @mytech6779
      @mytech6779 Před 11 měsíci +2

      I used to think that but realized I'll grow grey waiting on a decent implementation. SYCL seems to be stuck in some quasi-propriatary limbo with a company that won't or can't make it widely availible.

    • @VFPn96kQT
      @VFPn96kQT Před 11 měsíci +1

      @@mytech6779 The most popular Sycl implementations are #OpenSYCL and #DPC++ . Both are open-sourced and work on many different architectures. What do you mean - "stuck in quasi-propriatary limbo with a company" ?

  • @PramitBiswas
    @PramitBiswas Před 11 měsíci +1

    Open standards for ML (read TF) kernel API will help massively to achieve cross-hardware support.

  • @doppelkloppe
    @doppelkloppe Před 11 měsíci +1

    Are the differences in the images really due to different precision levels in the hardware or is it (also partly) due to limited determinism and reproducibility? After all you're not guaranteed to get the same image twice, even when using the same seed and HW.

  • @Timer5Tim
    @Timer5Tim Před 11 měsíci +12

    As nice as it is and as cool as it is, I expect ROCm for windows and Half Life 3 to come out on the same day.....

  • @jordanmccallum1234
    @jordanmccallum1234 Před 11 měsíci +3

    the promise for ROCm is huge, but better hardware support and better communications about what is and what is intended to be supported is needed. I had to buy a GPU a few years back, and really wanted an AMD GPU for the Linux drivers but I needed tensorflow ability for university. ROCm existed, but there was barely any documentation about what was supported, nothing on what they intend to support, and no timeline for software development, so I got a 2080.
    I remember roughly at the same time, AMD were touting that "you don't need to buy an instinct to do datacenter compute", but how is "datacenter compute is locked to tesla" any different to "there is no software support for radeon" when you want to get real work done *now*?

    • @leucome
      @leucome Před 11 měsíci +2

      Better communications for sure. One of the main issue is that the list they provide is not about the GPU that work with rocm but about the GPU AMD offer support. It is totally useless for people who want to know what GPU will actually run or not. As far as I know about all AMD GPU since vega are already working even if AMD dont offer official "support".

  • @MrMaximiliansa
    @MrMaximiliansa Před 11 měsíci +6

    Very interesting!
    Do you know why Stable Diffusion seems to use so much more VRAM on the MI210 than on the A100?

    • @Level1Techs
      @Level1Techs  Před 11 měsíci +5

      Maybe related to the accuracy stuff? I'm not sure tbh

  • @SamGib
    @SamGib Před 11 měsíci +7

    If AMD wants to get popular, they need to support their consumer grade GPU in ROCm. And also the used market.

  • @vsz-z2428
    @vsz-z2428 Před 11 měsíci +1

    thoughts on opensycl?

  • @samghost13
    @samghost13 Před 10 měsíci +1

    could you use those AI parts on Ryzen? I think it is a Notebook CPU

  • @apefu
    @apefu Před 11 měsíci

    This some guuud video!

  • @SlinkyBass0815
    @SlinkyBass0815 Před 4 měsíci

    Hi,
    I would like to get started with ML and currently do have 2 offers for graphics card.
    RX 6800 16 GB and RTX 4060 8 GB
    Do you know if the 6800 would be suitable for getting started or is it better to use the 4060?
    Thank you in advance!

  • @VegetableJuiceFTW
    @VegetableJuiceFTW Před 10 měsíci +1

    LLMs, please next!

  • @aacasd
    @aacasd Před 11 měsíci +1

    any benchmarks with AMD Ryzen AI?

  • @anarekist
    @anarekist Před 11 měsíci +1

    Aw was hoping to use rocm my 6800xt

    • @leucome
      @leucome Před 11 měsíci

      Try it... I bet it will work. My 6700xt and 7900xt work fine with ROCm. SO I guess that the 6800xt will work too.

  • @dearheart2
    @dearheart2 Před 10 měsíci

    I am in AL and never have access to the newest HW. Damn ...

  • @wecharg
    @wecharg Před 9 měsíci

    Best in the world at what he does ^

  • @Cadambank
    @Cadambank Před 3 měsíci

    With the new release of ROCm 6.0 can we revisit this topic?

  • @skilletpan5674
    @skilletpan5674 Před 11 měsíci +1

    There is a fork of automatic that supports AMD. It's in the main project readme or a google. It seems they randomly decided to drop support for some older cards a few months ago (rocm). Rx 5xx isn't supported and I think vega was also dropped.

  • @stuartlunsford7556
    @stuartlunsford7556 Před 11 měsíci +6

    AMD's FP64 cores are great, but they still need more dedicated AI silicon, preferably integrated on the same package.

  • @Fractal_32
    @Fractal_32 Před 11 měsíci +1

    I’m glad to be an AMD shareholder, although I guess I might grab a few more shares just in case. (My AMD shares have made a killing so far especially off this AI hype bubble.)

  • @ATrollAssNigga
    @ATrollAssNigga Před 11 měsíci +3

    As a 7940hs @ 90w owner i wonder how the built in ai processing compares to that m.2 card. I need to test it.

  • @EvanBurnetteMusic
    @EvanBurnetteMusic Před 11 měsíci +2

    Would love a better explanation for why the math is different. Could be that floating point math is not commutative. That is A * B does not equal B * A. Optimizing compilers sometimes break the order of operations in the name of speed.

    • @Level1Techs
      @Level1Techs  Před 11 měsíci +2

      developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/ mixed precision instead of full fat fp64. Usually the mantissa is not as many bits. Is why fp64 is a diff compute rate than "fp64" for ai

    • @EvanBurnetteMusic
      @EvanBurnetteMusic Před 11 měsíci

      @@Level1Techs My first thought was that the AMD card was using f32 instead of bfloat16 but I googled and it looks like bfloat16 has been supported since MI100. Perhaps the port isn't using the bfloat16 yet?

  • @floodo1
    @floodo1 Před 11 měsíci

    fascinating

  • @WhhhhhhjuuuuuH
    @WhhhhhhjuuuuuH Před 11 měsíci +3

    This is really interesting I was to know about how a 4090 vs a 7900XTX compares for these workflows. I know both are consumer products but I feel at the top end the line is blurred.

  • @Dallen9
    @Dallen9 Před 11 měsíci +2

    Pausing the Video at 11:37 If AMD is on the left, and Nvidia is on the right. AMD has the better Algorithm running than Nvidia. The Smart phone in Devito's hand isn't merging with the spoon and he has one button on his collar instead of two. Might have taken longer but the image looks more natural which is kind of nuts.

  • @DSDSDS1235
    @DSDSDS1235 Před 11 měsíci +1

    to be honest, you suggested that rocm went from can't train shit to can't train shit, which is what nvidia is specialises in. there are more inference startups dying each day than mi200s and mi300s combined shipped that day, and every vendor is coming up with their own inference chip. why would aws offer mi200 or mi300 when they can offer inf1 of their own and can abstract any software difference under ml frameworks? and if they do, why would anyone use that instead of inf1, or better yet, building their own?

  • @shrek22
    @shrek22 Před 9 měsíci

    Will w7900 with 48gb compare MI210?

  • @MrBillythefisherman
    @MrBillythefisherman Před 10 měsíci +1

    Where is a Microsoft DirectX style layer that sits on top of the GPUs and makes ML vendor agnostic (even if it makes it OS dependent)? If you dont like the OS specific DirectX API then swap in Vulkan API. Ive heard of DirectCompute and OpenCL but they dont seem to have gained traction - why? Also why is ROCm needed when you have those APIs - what is it that makes CUDA compete against all of the above?

  • @mr.selfimprovement3241
    @mr.selfimprovement3241 Před 11 měsíci +1

    ......I will never look at Danny DaVito the same again. 😱😳😂

  • @sayemprodhanananta144
    @sayemprodhanananta144 Před 11 měsíci

    training performance...?

  • @SirMo
    @SirMo Před 11 měsíci +1

    Open Source > Proprietary Vendor Lock-ins

  • @willz81
    @willz81 Před 11 měsíci +1

    Does ROCm work with Radeon 7900 series cards now?

    • @leucome
      @leucome Před 11 měsíci

      Yes... I use a 7900xt+ROCm for generating image with A1111.

  • @zherkohler4188
    @zherkohler4188 Před 10 měsíci +1

    Are you sure that the visual differences are because of the different hardware? Is Xformers disabled? I think it should be disabled for a test like this. I think it would explain the visual differences.

  • @Artificial-Insanity
    @Artificial-Insanity Před 9 měsíci +1

    The differences in the images stem from you using an ancestral sampler, not from the GPU you're using.

  • @zen.mn.
    @zen.mn. Před 10 měsíci +1

    "I can't believe it's not CUDA" dead

  • @WiihawkPL
    @WiihawkPL Před 10 měsíci

    working in opengl for a long time i've come to sum it up as nvidia playing it fast and loose and amd being more accurate. and then there's mesa, which is as close to a reference implementation as you'll getp

  • @eddietoro2682
    @eddietoro2682 Před 11 měsíci +1

    Canme here for the tech, stayed for the Danny DeVito AI memes

  • @bartios
    @bartios Před 11 měsíci +3

    Hi Wendell, have you been following the Tinygrad stuff and their troubles with ROCm at all? They look like they have some real work™ they'd like to be able to use AMD for in ML so I think it would be interesting for you to check out.

    • @Level1Techs
      @Level1Techs  Před 11 měsíci +6

      Someone didn't watch to the end of the video ;)

    • @bartios
      @bartios Před 11 měsíci +5

      @@Level1Techs whoops sorry, don't have the time to watch rn and I know the best time to get an answer is in the first couple hours so I did a stupid

  • @shieldtablet942
    @shieldtablet942 Před 11 měsíci +2

    AMD keeps dropping old GPUs in ROCm. RDNA has been ignored forever, I not even OpenCL worked at launch with regular drivers. So there will be little uptake when Nvidia has still something that performs ok at the lower end.
    Gaudi 2 is also looking OK and Intel seems committed to have the software running on potatoes.

  • @davtech
    @davtech Před 6 měsíci

    We need an update for ROCm 6.0 and RDNA3

  • @grimtagnbag
    @grimtagnbag Před 11 měsíci

    Run a GPT selfhosted instance

  • @Zoragna
    @Zoragna Před 11 měsíci +1

    Non-"PhD students at Oak Ridge" I love that

  • @paulwais9219
    @paulwais9219 Před 11 měsíci +1

    the demo is for inference, but training is key advantage to nvidia. need to get compute cards at gamer card scale in order for that software support to level out. that's why Ponte Vecchio and TPUs are DOA consumer products.
    but let's supposed AMD does catch up for the desktop. for mobile, apple and google and Samsung own their own stacks. for robotics, nvidia already has jetson. the market beyond the desktop would need to be big for AMD to really be able to invest and nail AI

  • @SamGib
    @SamGib Před 11 měsíci +1

    Unless Google sell TPU for enterprise to host themselves, I don't think there will be any large scale adoption to use in consumer products. See, OpenAI trained their model on GPU, best to assume that's Nvidia hardware.

  • @sailorbob74133
    @sailorbob74133 Před 11 měsíci

    What'll be interesting will be MI300C - which will be all CPU chiplets and Xilinix AI chiplets - Turin-AI... MLID has a video about it. A dual socket version could have more TOPs than an H100.

    • @samlebon9884
      @samlebon9884 Před 10 měsíci

      I imagined AMD would develop that kind of chip. I enen named it MI300AI.
      Could you provide a link to the MI300C?

    • @sailorbob74133
      @sailorbob74133 Před 10 měsíci

      @@samlebon9884 There's a very reliable rumor channel I've tracked for a few years called Moore's Law is Dead which spoke about the MI300C chip which is all CPU chiplets with HBM3 and a separate AMD project called Turin-AI which is a mix of Zen5 chiplets together with Xilinix AI chiplets on a single package which in a 2P config would be about as powerful as an H100.

  • @GldisAter
    @GldisAter Před 11 měsíci +2

    EVELON TECHS is going to be a new channel?

    • @levygaming3133
      @levygaming3133 Před 11 měsíci +1

      Assuming that wasn’t a joke, it’s just level one techs but windows or whatever is cutting off either side of the wall paper. L]evelOn[e techs.
      I only recognized because I’ve seen the full picture in one of the other vids, and even then it tripped me up a little.

    • @GldisAter
      @GldisAter Před 11 měsíci +1

      @@levygaming3133 The full picture is on the monitor to the right.

  • @djmccullough9233
    @djmccullough9233 Před měsícem

    Stable Diffusion runs really nicely on my 6800xt. Zluda cuda.