CppCon 2018: Jefferson Amstutz “Compute More in Less Time Using C++ Simd Wrapper Libraries”
Vložit
- čas přidán 13. 11. 2018
- CppCon.org
-
Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/CppCon/CppCon2018
-
Leveraging SIMD (Single Instruction Multiple Data) instructions are an important part of fully utilizing modern processors. However, utilizing SIMD hardware features in C++ can be difficult as it requires an understanding of how the underlying instructions work. Furthermore, there are not yet standardized ways to express C++ in ways which can guarantee such instructions are used to increase performance effectively.
This talk aims to demystify how SIMD instructions can benefit the performance of applications and libraries, as well as demonstrate how a C++ SIMD wrapper library can greatly ease programmers in writing efficient, cross-platform SIMD code. While one particular library will be used to demonstrate elegant SIMD programming, the concepts shown are applicable to practically every C++ SIMD library currently available (e.g. boost.simd, tsimd, VC, dimsum, etc.), as well as the proposed SIMD extensions to the C++ standard library.
Lastly, this talk will also seek to unify the greater topic of data parallelism in C++ by connecting the SIMD parallelism concepts demonstrated to other expressions of parallelism, such as SPMD/SIMT parallelism used in GPU computing.
-
Jefferson Amstutz, Software Engineer
Intel
Jeff is a Visualization Software Engineer at Intel, where he leads the open source OSPRay project. He enjoys all things ray tracing, high performance computing, clearly implemented code, and the perfect combination of git, CMake, and modern C++.
-
Videos Filmed & Edited by Bash Films: www.BashFilms.com
*-----*
Register Now For CppCon 2022: cppcon.org/registration/
*-----*
As for GPU kernels vs CPU kernels, the difference is in relative cost of memory operations compared to register calculation speed as well as size of register file. GPUs tend to have order of magnitude faster calculation while memory is on par or slower due to relatively smaller per thread caches - so you have to even more so spare memory bandwidth.
Also GPUs prefer bigger block operations than CPUs due to memory/cache architecture. That's about it.
28:50 from what i understand trig functions are available only on avx512 which exists only on few xeons and, so far, very few consumer-grade CPU?
Such an interesting topic !
Great talk, many thanks! +1
Great introductory talk!
Thank you!
Very good information, thank you.
The examples could be a bit more realistic. Neural networks use fundamental linear transformation Ax+b (A is matrix, x and b are vectors), 3d graphics use vectors of {x,y,z,w} (w is needed for transforms and perspective projection) along with 4x4 transform matrix multiplications.
cool stuff
“I can code faster in assembly” is the equivalent flex of “I can shift faster than your automatic”
No one says such thing. What is said is "i can generate code in assembly that runs faster than you with your high level language compiler." And this can be in some circumstances true, because there are cases, where a Compiler, that is written in a generic way to optimize for a very wide field of cases, just can't or isn't allowed to optimize because his generic approach must produce mathematically correct code for all cases, not only for some of them. So the compiler will not optimize it in the latter case, but the human can do it, because he also knows the details of such a special case, like its expected input and its limits. Thus he will be able to optimize for this specific case, where the generic compiler will fail.
And in other typically more complex cases it can happen that the compiler is not advanced enough and just not able to optimize, because it doesn't see, that optimization is possible here.
In the first case there will never be a solution for the compiler because it cannot be mathematically guaranteed. Thus this problem stays forever. No matter how advanced the compilers will become.
In the latter case, it is a matter of developmental stage. Thus there is still room for improvement here and the compiler can become better here.
And of course we all know human limitations and why the compiler is better in most cases. So we don't have to talk about that, this is about a specialization where the programmer knows the ISA and the optimization tricks very well.
Conclusion: In the end you should therefore know your own limits and the limits of the compiler.
in examples there are no any handling of tails
10:46 AMD GPUs do exactly 64 floats side by side though, right? Well, it's a bit iffy if you'd call that a SIMD register anyway.
AMD GPUs dont actually have 64 float literal SIMD units, they have SIMD units that operate on 16 floats and then they pack 4 of them together and run them in lockstep to make the 64 float wavefront
You dont have to modify your code at all with "vertical" vectorization... Just apply simd to all operations and enjoy free speed upgrade...
Meanwhile with horizontal, you have to rewrite your code competely for ray tracing, handle pointers of materials, reduction of closest hit and mainly recursion where paths and steps of each vectorized ray are very different.
Why don't we have an "AsmCon" ? That could teach a few lessons to all the modern C++ hippsters.
Great idea Lothar!!
Wait, if the C++ guys are the hipsters - who attends PyCon? Or even worse JSConf?
Why doesn't your imaginary AsmCon have a comment about having a VerilogCon? That could teach a few lessons to all the ASM ingrates.
(I'd put a smiley face emoticon here but I'm not hipster enough to know how to do that.)
@@aaardvaaark I love how you called it an emoticon and not an emoji. :)
What would be the benefit of using the library over just letting the compiler autovectorize code? Modern compilers are already doing a pretty good job at that.
tried it a while ago, seems like alignment isn't really an issue on modern systems, if i program the loop right and use O3 i end up with SIMD instructions
Modern Compilers are not very good with auto-vectorization. I don't know where you got that from, but it's simply not true.
"Modern compilers are already doing a pretty good job at that." no they dont
36:46 "Saxpy is nonsense as well" - pardon me, but SAXPY is at the core of most artificial neural networks: input*weight + bias. Just sayin'.
guy has ego
yup, not my best moment....in my head I was more thinking "this _particular_ SAXPY thing I wrote", which didn't come out right at all! Thanks for clarifying.
In AI, a is a matrix, not a scalar. Ax+b is not SAXPY.
@@max0x7ba Yes, that is technically correct. Local Response Normalisation (LRN) layers in CNNs, however, use straight up (S)AXPY in their implementation, so I should have been more specific indeed.
@@totalermist May be, but you cannot build a neural network by just using SAXPY. However, you can build a neural net just by using fundamental Ax+b. Your claim that SAXPY is at the core of most neural nets is false.