Why Writing SIMT (GPU) Assembly is Hard

Sdílet
Vložit
  • čas přidán 19. 06. 2024
  • Modern GPUs have Single-Instruction Multiple Thread (SIMT) architectures, and writing assembly code for them is very hard. In this video Hans explains why...
    Click the following link for a summary:
    keasigmadelta.com/blog/why-wr...
  • Věda a technologie

Komentáře • 19

  • @mrshodz
    @mrshodz Před 2 měsíci +1

    Interesting video.

  • @user-mf9qw3cp9f
    @user-mf9qw3cp9f Před 3 lety +2

    THANK YOU a LOT !!!!!!!!!!!

  • @manueljenkin95
    @manueljenkin95 Před 2 lety +1

    Thank you very much. This was very insightful. I might still explore nvidia’s ptx when possible. (I also think spir v has a format that closely resembles assembly, like llvm is for general machines)

    • @KeaSigmaDelta
      @KeaSigmaDelta  Před 2 lety +1

      You're welcome. I'm glad you find it insightful.
      Speaking as someone who wrote an SPIR-V parser, SPIR-V is a low-level binary file format, and definitely not something that you'd want to program in.

    • @manueljenkin95
      @manueljenkin95 Před 2 lety

      @@KeaSigmaDelta I have some issues with restrictions on CUDA and was wondering if there’s ways to make it better. I basically want to have the equivalent of a large constant memory, that is accessed and cached in chunks of 8 k bytes (the general value) without having to relaunch the kernel and do a shared memory load again.
      Basically I want to have access to explicitly mentioning the movement from the constant memory to the gpu cache memory (used for the convolution operations etc).

    • @KeaSigmaDelta
      @KeaSigmaDelta  Před 2 lety +1

      @@manueljenkin95 I'm not familiar enough with CUDA or nVidia's architecture to know if that's possible. What the drivers do also matters. For example, the shared memory load could actually retrieve the data from a cache, unless the driver clears the caches between kernel runs (in which case it's always going to have to fetch from memory on initial load).

    • @manueljenkin95
      @manueljenkin95 Před 2 lety

      @@KeaSigmaDelta thank you. In my case I’m sure it does a reload due to a large dataset size which means it’ll get overwritten by new data before the next iteration or process that would require same data.

  • @darkengine5931
    @darkengine5931 Před 3 lety

    One thing I've always wondered about is the general performance cost of branching on GPUs. I'm hardly an expert there, but in our architecture, to minimize branching we write a separate fragment shader for each type of material and render each material type in separate forward passes. Yet that does impose some rather heavy costs on the CPU side having to group triangle primitives based on their material ID and keep that in sync with our geometry data structures that can show up as hotspots in some operations (any kind of editing of mesh topology tends to be quite expensive for us, like inserting and removing polygons on the fly or loading a brand new mesh on the fly).
    So I've often wondered if it might be cheaper, even though I hear branching on the GPU is quite complicated and expensive, to actually just store material IDs in parallel with our primitives (or as vertex attributes) and just do like a massive switch in one monolithic shader on the material ID/index to determine how to shade things in a single uber frag shader we might generate based on all the materials available. I've done some personal tests that show some promising results there, but still held back from trying to apply that on a large scale out of all the suggestions to minimize branching in shaders.

    • @KeaSigmaDelta
      @KeaSigmaDelta  Před 3 lety +2

      It's hard to say for certain because GPUs have many different bottlenecks. I'd err on sending large vertex arrays to the GPU, because one bottleneck is how many draw ops/s can be handled. Functions like glMultiDrawElementsIndirect() & newer APIs like Vulkan have increased the number of draw ops/s. Also, if all threads in a wavefront execute the same path, then only one path needs to be executed. And, SIMT architectures can switch wavefronts during pipeline stalls which increases throughput.
      All in all, measuring it (on multiple graphics cards) is the only way to really know which strategy works best.

    • @darkengine5931
      @darkengine5931 Před 3 lety

      ​@@KeaSigmaDelta One of the tricky things I've found is that it's difficult to get a wide range of GPU hardware representative of the full variety of what our users might use. I've developed a bit of an allergy to GPU programming around the early 2000s, not by doing it so much myself (I did the bare minimum) but by watching my colleagues deal with very painful issues that would only show up on one end-user's machine... as well as very fragile GLSL shaders that would work on one type of hardware but not the next (back then it seemed especially prone for code that works on NVidia cards to not work on ATI/AMD or vice versa, and especially onboard GPUs seemed problematic for us). We frequently got issues back then where the entirety of our team couldn't reproduce an issue on the end-user's machine, suggest they update their drivers, and to no avail. It doesn't seem as perilous these days, but I've developed a bit of a fear of GPU programming from watching my colleagues back then. Another side of me wants to get into it since I realize the sheer parallel power of these things when it comes to number crunching, but if I had my way, I'd want to write like a tiny bit of GPU code (definitely not in assembly), ship the product and get it tested by a boatload of people, then write a little bit more, ship again. :-D
      Kind of tying to that video of appreciating bug reports, I've got a bit of that fear when it comes to GPU code (also newer versions of SSE). It's not so much for the bug reports. My worst fear in programming is creating a bug none of us can reproduce, whether it's a most obscure race condition or just code that doesn't work on a very specific type of hardware. I've had to chase down just a handful of those in my career, and they were enough to give me nightmares after (some took me whole weeks of stabbing blindly in the dark while bouncing builds to the one user who could reproduce the issue to pinpoint down).

    • @KeaSigmaDelta
      @KeaSigmaDelta  Před 3 lety +2

      @@darkengine5931 Yes, hardware-specific problems with GPU programming still exist, although it has improved a lot since the early 2000s. This is where having more testers is helpful, because buying all those different GPUs (plus motherboards and extras to host them all) gets expensive very quickly.
      A few silly differences I've encountered (via ShaderToy):
      - NVidia seems to zero initialize uninitialized variables, but AMD doesn't. There are shaders that rely on this... **
      - I encountered a shader that relies on infinite loop detection and mitigation in order to work
      ** I've encountered this with CPU code too. For example, code that only works if the OS zeros memory before giving it to a program.

    • @darkengine5931
      @darkengine5931 Před 3 lety +1

      @@KeaSigmaDelta One of the things my colleagues experienced with the GPU side tend to complain about is that NVidia, in general, seems to be more tolerant to code that might otherwise produce undefined behavior, like zero-initializing variables as you mentioned. We used to get the bulk of our GPU-related bug reports on ATI/AMD side and I naively thought that was because the GPUs were of poorer quality, but apparently, it was more often because NVidia GPUs/drivers weren't allowing us to detect mistakes on our end. Our shader devs tend to prefer AMD now for primary GPU development (though they actually have multiple GPUs -- the company buys them a variety) since it seems like there's a higher probability that code that works on AMD will work on NVidia than vice versa.