CppCon 2018: Jefferson Amstutz “Compute More in Less Time Using C++ Simd Wrapper Libraries”

Sdílet
Vložit
  • čas přidán 13. 11. 2018
  • CppCon.org
    -
    Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/CppCon/CppCon2018
    -
    Leveraging SIMD (Single Instruction Multiple Data) instructions are an important part of fully utilizing modern processors. However, utilizing SIMD hardware features in C++ can be difficult as it requires an understanding of how the underlying instructions work. Furthermore, there are not yet standardized ways to express C++ in ways which can guarantee such instructions are used to increase performance effectively.
    This talk aims to demystify how SIMD instructions can benefit the performance of applications and libraries, as well as demonstrate how a C++ SIMD wrapper library can greatly ease programmers in writing efficient, cross-platform SIMD code. While one particular library will be used to demonstrate elegant SIMD programming, the concepts shown are applicable to practically every C++ SIMD library currently available (e.g. boost.simd, tsimd, VC, dimsum, etc.), as well as the proposed SIMD extensions to the C++ standard library.
    Lastly, this talk will also seek to unify the greater topic of data parallelism in C++ by connecting the SIMD parallelism concepts demonstrated to other expressions of parallelism, such as SPMD/SIMT parallelism used in GPU computing.
    -
    Jefferson Amstutz, Software Engineer
    Intel
    Jeff is a Visualization Software Engineer at Intel, where he leads the open source OSPRay project. He enjoys all things ray tracing, high performance computing, clearly implemented code, and the perfect combination of git, CMake, and modern C++.
    -
    Videos Filmed & Edited by Bash Films: www.BashFilms.com
    *-----*
    Register Now For CppCon 2022: cppcon.org/registration/
    *-----*

Komentáře • 30

  • @AstralS7orm
    @AstralS7orm Před 5 lety +7

    As for GPU kernels vs CPU kernels, the difference is in relative cost of memory operations compared to register calculation speed as well as size of register file. GPUs tend to have order of magnitude faster calculation while memory is on par or slower due to relatively smaller per thread caches - so you have to even more so spare memory bandwidth.
    Also GPUs prefer bigger block operations than CPUs due to memory/cache architecture. That's about it.

  • @GeorgeTsiros
    @GeorgeTsiros Před 4 lety +4

    28:50 from what i understand trig functions are available only on avx512 which exists only on few xeons and, so far, very few consumer-grade CPU?

  • @antoningavrel2808
    @antoningavrel2808 Před 5 lety +3

    Such an interesting topic !

  • @xarcaz
    @xarcaz Před 5 lety +12

    Great talk, many thanks! +1

  • @MindGameArcade
    @MindGameArcade Před 3 lety +2

    Great introductory talk!

  • @max0x7ba
    @max0x7ba Před 5 lety +4

    Very good information, thank you.
    The examples could be a bit more realistic. Neural networks use fundamental linear transformation Ax+b (A is matrix, x and b are vectors), 3d graphics use vectors of {x,y,z,w} (w is needed for transforms and perspective projection) along with 4x4 transform matrix multiplications.

  • @zhaoli2984
    @zhaoli2984 Před 5 lety

    cool stuff

  • @tc2241
    @tc2241 Před 3 lety +14

    “I can code faster in assembly” is the equivalent flex of “I can shift faster than your automatic”

    • @OpenGL4ever
      @OpenGL4ever Před 9 měsíci +1

      No one says such thing. What is said is "i can generate code in assembly that runs faster than you with your high level language compiler." And this can be in some circumstances true, because there are cases, where a Compiler, that is written in a generic way to optimize for a very wide field of cases, just can't or isn't allowed to optimize because his generic approach must produce mathematically correct code for all cases, not only for some of them. So the compiler will not optimize it in the latter case, but the human can do it, because he also knows the details of such a special case, like its expected input and its limits. Thus he will be able to optimize for this specific case, where the generic compiler will fail.
      And in other typically more complex cases it can happen that the compiler is not advanced enough and just not able to optimize, because it doesn't see, that optimization is possible here.
      In the first case there will never be a solution for the compiler because it cannot be mathematically guaranteed. Thus this problem stays forever. No matter how advanced the compilers will become.
      In the latter case, it is a matter of developmental stage. Thus there is still room for improvement here and the compiler can become better here.
      And of course we all know human limitations and why the compiler is better in most cases. So we don't have to talk about that, this is about a specialization where the programmer knows the ISA and the optimization tricks very well.
      Conclusion: In the end you should therefore know your own limits and the limits of the compiler.

  • @ilnurKh
    @ilnurKh Před 5 lety

    in examples there are no any handling of tails

  • @msqrt
    @msqrt Před 4 lety

    10:46 AMD GPUs do exactly 64 floats side by side though, right? Well, it's a bit iffy if you'd call that a SIMD register anyway.

    • @GrayOlson
      @GrayOlson Před 3 lety +4

      AMD GPUs dont actually have 64 float literal SIMD units, they have SIMD units that operate on 16 floats and then they pack 4 of them together and run them in lockstep to make the 64 float wavefront

  • @Antagon666
    @Antagon666 Před rokem +1

    You dont have to modify your code at all with "vertical" vectorization... Just apply simd to all operations and enjoy free speed upgrade...
    Meanwhile with horizontal, you have to rewrite your code competely for ray tracing, handle pointers of materials, reduction of closest hit and mainly recursion where paths and steps of each vectorized ray are very different.

  • @llothar68
    @llothar68 Před 5 lety +20

    Why don't we have an "AsmCon" ? That could teach a few lessons to all the modern C++ hippsters.

    • @antoningavrel2808
      @antoningavrel2808 Před 5 lety

      Great idea Lothar!!

    • @totalermist
      @totalermist Před 5 lety +17

      Wait, if the C++ guys are the hipsters - who attends PyCon? Or even worse JSConf?

    • @aaardvaaark
      @aaardvaaark Před 5 lety +13

      Why doesn't your imaginary AsmCon have a comment about having a VerilogCon? That could teach a few lessons to all the ASM ingrates.
      (I'd put a smiley face emoticon here but I'm not hipster enough to know how to do that.)

    • @ThePC007
      @ThePC007 Před 3 lety +2

      @@aaardvaaark I love how you called it an emoticon and not an emoji. :)

  • @decayl
    @decayl Před 5 lety +2

    What would be the benefit of using the library over just letting the compiler autovectorize code? Modern compilers are already doing a pretty good job at that.

    • @flob1920
      @flob1920 Před 5 lety +3

      tried it a while ago, seems like alignment isn't really an issue on modern systems, if i program the loop right and use O3 i end up with SIMD instructions

    • @lincolnsand5127
      @lincolnsand5127 Před 4 lety +8

      Modern Compilers are not very good with auto-vectorization. I don't know where you got that from, but it's simply not true.

    • @empireempire3545
      @empireempire3545 Před 2 lety +2

      "Modern compilers are already doing a pretty good job at that." no they dont

  • @totalermist
    @totalermist Před 5 lety +7

    36:46 "Saxpy is nonsense as well" - pardon me, but SAXPY is at the core of most artificial neural networks: input*weight + bias. Just sayin'.

    • @fahimp3
      @fahimp3 Před 5 lety

      guy has ego

    • @jeffersonamstutz
      @jeffersonamstutz Před 5 lety +20

      yup, not my best moment....in my head I was more thinking "this _particular_ SAXPY thing I wrote", which didn't come out right at all! Thanks for clarifying.

    • @max0x7ba
      @max0x7ba Před 5 lety +3

      In AI, a is a matrix, not a scalar. Ax+b is not SAXPY.

    • @totalermist
      @totalermist Před 5 lety

      @@max0x7ba Yes, that is technically correct. Local Response Normalisation (LRN) layers in CNNs, however, use straight up (S)AXPY in their implementation, so I should have been more specific indeed.

    • @max0x7ba
      @max0x7ba Před 5 lety +1

      ​@@totalermist May be, but you cannot build a neural network by just using SAXPY. However, you can build a neural net just by using fundamental Ax+b. Your claim that SAXPY is at the core of most neural nets is false.