CppCon 2016: Timur Doumler “Want fast C++? Know your hardware!"

CppCon

zhlédnutí 164 484

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 11. 06. 2024
CppCon.org
-
Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/cppcon/cppcon2016
-
As C++ evolves, it provides us with better and more powerful tools for optimal performance. But often, knowing the language very well is not enough. It is just as important to know your hardware. Modern computer architectures have many properties that can impact the performance of C++ code, such as cache locality, cache associativity, true and false sharing between cores, memory alignment, the branch predictor, the instruction pipeline, denormals, and SIMD. In this talk, I will give an overview over these properties, using C++ code. I will present a series of code examples, highlighting different effects, and benchmark their performance on different machines with different compilers, sometimes with surprising results. The talk will draw a picture of what every C++ developer needs to know about hardware architecture, provide guidelines on how to write modern C++ code that is cache-friendly, pipeline-friendly, and well-vectorisable, and highlight what to look for when profiling it.
-
Timur Doumler
ROLI Ltd.
JUCE Senior Software Engineer
London, UK
Timur Doumler is Senior Software Developer at London-based technology company ROLI. He is currently working on JUCE, the leading cross-platform framework for creating audio applications used by hundreds of companies in the audio industry. After five years of writing high-performance code in Fortran, C, and C++ for numerical simulations of the cosmic structure formation, Timur became committed to audio and music production software. Before joining ROLI, he worked on various projects at market-leading company Native Instruments, such as KONTAKT, the industry standard sampling platform used by the majority of music producers and composers for film score, games, and contemporary popular music. Timur holds a PhD in astrophysics and is passionate about well-written code, modern C++ techniques, science-fiction, learning languages, and progressive rock music.
-
Videos Filmed & Edited by Bash Films: www.BashFilms.com
*-----*
Register Now For CppCon 2022: cppcon.org/registration/
*-----*

Komentáře • 118

@digitalconsciousness Před 2 lety ⁺¹¹⁰
This accessing of L1 cache is the reason game programmers are talking about Data Oriented Design (DOD) in the past few years. It is expensive to send an entire C++ object to L1 cache just to update an x and a y coordinate. You cannot put 100,000 ships on screen quickly. With DOD however, all the x's for all the ships are in a single array. Doing it this way, this array gets sent to the cache, and you rip through all 100,000 x values, then you do the same for the y array. Insanely fast.
@jeeperscreeperson8480 Před rokem ⁺¹³
Game programmers talk about dod/dop because this style of programming reduces software complexity. And the simpler the software, i.e. the less generic it is, the faster it is generally. An array of floats is simpler than a vector of objects. Dereferencing a float pointer is simpler than calling a getter. Loading 4 floats from an array into an xmm register is simpler than gathering those floats from 4 separate objects, etc.
@dd-qz2rh Před 10 měsíci ⁺¹⁵
@@jeeperscreeperson8480 it seems like you are mixing the software complexity definition which is mostly used in order to describe the complexity of a source code as perceived by humans reading said code and "simple" code that generates fewer instructions/those instructions are fast(er than non-dod code). DOD code can be much harder to read/write by an average programmer, it's complexity actually increases because a lot of stuff is not hidden by some abstractions anymore + you control and think about much more stuff than simply putting together some logic and abstracting it away via some interface/design pattern/etc.. And then what you call "simple" code is in fact just "fast" and "optimized" code, that is what has already been mentioned in the previous comment.
@teranyan Před 2 lety ⁺⁹¹
I once got rejected at a job interview because I told them exactly that thing you pointed out about the list/trees, that if you must use it at least it can be accelerated with a custom allocator (preallocating nodes) depending on your use case and if cache misses are a bottleneck. Apparently their "senior" programmer guy didn't agree with that one
@puppergump4117 Před 2 lety ⁺²
Anyone who listens to the people next to them without doing research are dumb.
@heavygm9623 Před 2 lety ⁺⁷
lol
@ShaunYCheng Před rokem ⁺⁶
What company was that?
@samhadi7972 Před 9 měsíci ⁺⁴
Ya when you talk like that to your everyday devs they think you are from a different universe or making stuff up lol
@andreytimokhin936 Před 4 lety ⁺³⁸
Well organized talk with excellent explanations of the reason causing performance degradation using easy to understand examples.Would be very helpful event for quite experienced developers
@KilgoreTroutAsf Před 5 lety ⁺¹²³
No1 tip for high performance: MEASURE EVERYTHING.
It doesnt matter how strongly you THINK something is better. The technology stack is complex enough to interact in unforseeable ways.
@aorusaki Před 4 lety ⁺⁶
You want both. Especially since some things are hard to debug/measure accurately being able to deterministcally understand SOME part of whats going on will give u an invaluable heuristic that statistics alone can't grant. (Ie, being able to look at assembly code with a datasheet with ur microcontroller will give u veeerry exact performancr characterisitcs of X instruction over Y))
@aorusaki Před 4 lety ⁺⁴
In addition, its STILL importsnt to understand WHYYY x or y is slower than ghe other so you can work around it and or get hardware vendors to improve
@seditt5146 Před 4 lety ⁺⁸
Dude you have no idea how right you are. I have been trying my hardest to benchmark multithreaded code and get accurate results esp comparing it against linear algorithms... Due to compiler optimizations, CPU optimizations, the overall difficulty of Benchmarking MT code in the first place this entire ordeal has become a nightmare and no matter what results I get and how close they are to what I believe they should be there is still that little voice in my head telling me I just can not trust the output of the Measurements since there is just so damn much going on under the hood idk what to believe about it all.
@ChrisM541 Před 2 lety
You NEED to have significant expertise in assembly language. No point in having that debugger if you don't FULLY understand that core component...!!!
@samhadi7972 Před 9 měsíci
I wasted so many hours of my life optimising based on theory only for the result to end up being slower… definitely bench mark and do it often if you care for the results
@NehadHirmiz Před 7 lety ⁺⁴³
Thank you very much. I really like your choice of presenting very intuitive examples.
@minastaros Před 3 lety ⁺⁴⁰
Quite dry stuff, and not really what I need for my daily work. But gave me a huge amount of new knowledge. Very well presented, really good examples, and in general a clear and entertaining talk. I have really enjoyed watching it!
@CppCon Před 3 lety ⁺⁸
Glad it was helpful!
@user-zp3nd6ht8v Před 4 lety ⁺⁷
Very intuitive example！Down to earth knowledge!
@Altekameraden79 Před 7 měsíci ⁺¹
I am just a mechanical engineer doing vibration and signal analysis, but this presentation was very informative, even for me barely starting out in C++ a few days ago.
@aorusaki Před 4 lety ⁺⁸
These conventions NEVER fail to provide useful knowledge and springboard ideas to help insoire newer developments and technologies!
@TrueWodzu Před 5 lety ⁺¹⁶
Thank you, some stuff I knew, but never heard of denormals. Would have to check them out :)
@deckarep Před měsícem
I know this talk is quite a few years old and it has some solid points to consider but I did want to comment on a few things. One of them is that he compares audio software engineering to game development saying that while game devs worry about code having to be performant at 60 fps, on the other hand audio software development has to worry about 44k sps (samples per second) and this is not a fair comparison. Yes, a game loop may need to run at 60 fps, but within the context of a single frame, many computations are being done on the CPU, and likely way more than a few hundred thousand data-points.
The second thing: At one point he says that in general you want to structure your Array of structs next to each other for locality which is true in general but if you utilize the Struct of Arrays (SOA) paradigm, which I don't think he mentions at all you may likely get even better performance out of your code due to the memory being more tightly compacted and also due to not having to fetch certain objects for computations that you don't need.
@prafulsrivastava8942 Před 2 lety ⁺⁷
This is exactly what I was looking for. Great work! Thanks for the info!
@CppCon Před 2 lety ⁺¹
Glad it was helpful!
@renzocoppola4664 Před 7 lety ⁺⁵⁵
He knows his stuff.
@suryatn Před 4 lety
Nice explanation with code samples and supporting data analysis.
@marec023 Před 4 lety ⁺⁴
What a great talk, thanks!
@Peregringlk Před 4 lety ⁺⁹
15:30 hash for integers is usually the identity function which is a no-op.
@TheGokhansimsek35 Před 3 lety ⁺⁵
Excellent talk, full of great information and tips!
@CppCon Před 3 lety ⁺²
Glad it was helpful!
@ciCCapROSTi Před 8 měsíci
Good presentation, knew about all of it except the FPU stuff, but got way more useful details than I had.
@michaelmorris2300 Před 7 lety ⁺⁶
Good talk. People need to remember that performance in general purpose computing will have its limits, where Fpgas would take could take over. There is only so much performance that can be squeezed out of general purpose computing notwithstanding the versatility and quick development costs of that paradigm. The talk was very good.
@dvlduvall Před 5 lety ⁺²
Nailed, great examples.
@sundareswaransenthilvel2759 Před 5 lety
Thanks a lot! Nice explanation!
@danielnitzan8582 Před 3 lety ⁺¹
Excellent talk!
@NekonyaNyaNya1 Před 5 lety ⁺¹
Great stuff, at least for someone without degree in SC, thanks!
@praveenalapati5234 Před 7 lety ⁺¹
Good explanation
@coder2k Před 4 lety
Great talk! Thank you! :)
@jaime7295 Před 10 měsíci
This was really good!!!
@ambushedraccoon6408 Před 11 měsíci
Hello! Great content.
Is there a link for a video mentioned at 19:31 ?
@kapyonik2424 Před 4 lety ⁺¹⁰
52:08 : here, I see a different reason than the one mentionned. One the left code, there is 4 different values accessed ( a[i] b[i] b[i+1] and c[i] ) while on the right version code, only 3 values are accessed ( b[i+1] c[i] and a[i+1] ) so is it really about vectorization ? I don't understand this part can someone explain please ?
@BGFutureBG Před 4 lety ⁺⁴
The same thing stood out to me initially, but I'm pretty confident it's insignificant here. b[i] and b[i+1] are extremely likely to be on the same cache line.
@qm3ster Před 2 lety ⁺¹
Summary of what is most surprising:
1. cache associativity hole on round values
2. some modern cpus are good at unaligned access (I was sure we are moving away from that instead?)
3. false sharing
@dineshdevaraju6698 Před 6 lety
Awesome Thanks...
@marktaylor6052 Před 3 lety ⁺¹
33 mins 40 secs, I found this out myself trying to optimise a virtual machine. I thought aligning instructions and arguments on 4 or 8 byte boundaries would help. In the end I saw I was only getting a 1% increase in performance.
@azadalishah2966 Před 3 lety
🤔How will you find size of cache line [before using alignas(cachelineSize) ]?
@Luix Před 4 lety
Very interesting
@KeithMakank3 Před 2 lety
What are we supposed to do about problems that specifically require random traversal of arrays?
@jackofnotrades15 Před 3 lety ⁺¹
Just curious, Isnt this why data-oriented design is preferred? Seems like this gives a good example for Data oriented Design. Am I wrong??
@sirnawaz Před 3 lety ⁺¹
The example at 56:20 is interesting. I didn't quite understand the alignment issues with the offset. Is it because of different offset for both buffers? If I use same offset, say `+2`, for both, would it still have the alignment issue?
@sirnawaz Před 2 lety ⁺¹
Watched this talk again a year later. I still have the same doubt. Haha.
@ChristianBrugger Před rokem
The problem is that SIMD instructions only work for aligned data. If both have the same offsets the compiler can special case, the first 2, and then take 4 aligned each.
@7Denial7 Před 7 měsíci
I think there's a mistake at 27:00. 8-way associative cache means there must be 8 blocks(cache lines) per set. So If we have 128 KB 8-way cache, 64 bytes per Block, this means 128 * 1024 / (64 * 8) = 256 sets, 512 bytes per set, 8 blocks per set. Not 8 sets 16KB each like you show in the presentation
@RajeshPandianM Před rokem
At 53:20, is there a typo on left side code at line .. a[i] +=b[i+1] instead of a[i] +=b[i]?
@elliott8175 Před 3 lety ⁺²
To vectorise the loop (at 52:00) why not just do this:
for (int i = 1; i < 998; ++i)
a[i] += b[i]
for (int i = 1; i < 998; ++i)
b[i+1] += c[i]
Not only does this get rid of dependencies, but it results in requiring 2 positions in the cache (for each of the arrays) rather than 3, so better overall practice to reduce crowding. Or does this actually screw up the SIMD vectorisation? I don't know much about it.
@ChrisM541 Před 2 lety
Make certain you confirm - and fully understand - what the compiler is doing by critically examining the assembly version of your routine.
@Wayne-wo1wc Před 2 lety
This could take double the time. A CPU can sometimes fit two instructions in one cycle in the same core. Therefore it is better to do it in one loop.
@seditt5146 Před 4 lety ⁺²
At 41:00 He stated no branch predictor can work with Random Numbers but I was wondering if they can detect patterns and the BP is capable of looking ahead why is it not possible for it to generate the Random number in advance and make predictions using that pregenerated Random number
@mrlithium69 Před 4 lety
not an expert, but random is usally directly affected by the CPU clock as the source of entropy, so picking a number in advance is not the same number as picking it normally. Also picking extra random numbers and then throwing them away cause of any invalid branches would have an effect, some would be missing, and the following set of random numbers would not match the the original pattern of distribution.
@flutterwind7686 Před 4 lety
Depending on the CPU architecture, branch prediction can look ahead and precalculate based on different possibilities only up to a certain point before running out of cache. IF it invests entirely on a certain possibility, if that guess is wrong there is a huge penalty since it has to start calculating the branch all over again (from the branching point in code ex. if statement). This limits how far the branch predictor can look ahead when dealing with random numbers (as the number of possibility quickly inflate) in for example a nested if structure.
@bencenagy2542 Před rokem
Actually, frame rendering has to provide much more samples than 24-60. You have to render screenwidth*screenheight samples. So for a 1080p resolution that is about 2*10e6 samples.
Před rokem ⁺³
Maybe, but real-time sound programming is a little different. In a game, for example, you can render the same frame twice and nothing bad will happen. In sound, the membrane of a speaker is a physical object with kinetic energy, and if you send the same signal 2 samples in a row, that means the magnet should apply infinite braking power to stop it for a moment. It's like pulling the emergency brake on a locomotive, the listener will hear an audible click.
@masheroz Před rokem
What is the talk he refers to at 49:45?
@Shardic Před 5 lety
I loved this talk. Never knew that a simple contiguous set of variables can be SO MUCH faster than randomly situated ones.
@severemaisjuste Před 4 lety ⁺²
Not considering any oversampling in either domain, isn't it more like ?
gfx : 1920*1080*30
sfx : 44100 * 2
samples per second ?
@shinyhappyrem8728 Před 4 lety
Why 30?
@jamessandham1941 Před 4 lety ⁺¹
Ya I don't really know why he made that statement. Kind of needlessly provocative IMHO. Its even more skewed in games than you indicate because in games you are not just sending colors to the screen but also dealing with physics systems, multiplayer, tons of custom tools, fast compile times, and a general complexity of game play that is only limited by a designers creativity, etc. Otherwise good talk though.
@TernaryM01 Před 4 lety ⁺³
It's harder to write a program that does some computation in 1/44100 seconds than 44100 such computations in 1 second. It's about bandwidth vs latency. His point is that not only do you need to make computations as efficient as possible, but they also have to have low latency. You don't worry so much about latency in programming a game renderer, and the GPU itself is not optimized for latency, unlike the CPU.
Of course he is not an idiot who thinks that the computation he's doing in audio is 200 times faster than game engines. Game engines are written in C++ and C, not Python.
@BGFutureBG Před 4 lety
I too think processing 60 frames (or even 250 with modern monitors these days) in a video game, is way way way more expensive than sampling 44kHz audio...
@teranyan Před 2 lety ⁺¹
The CPU (which you use to process audio) usually has 8 strong cores while modern GPUs have 3000+ "stupid" cores. It's not the same type of sample and you can't really compare that.
@AshishNegi1618 Před 4 lety ⁺⁴
if `rand()` is inside the for loop, then `random loop on 2-d array` would be slow because of cost of `rand()` as well..
[ of course, random 2 d access will be slow.. but should not be that slow compared to column wise access ]
@ivailomanolov6908 Před 3 lety
Can someone explain to me 54:19 Example please ? I have a hard time understanding it .
@antonfernando8409 Před 2 lety
Something to with: SIMD vectorisation
@teekwick Před 7 lety
Really great talk, tho next time maybe try to reduce the number of sub-topics or ask for more time to present them since it looked like you tried to rush in the second half to get everything out.
@DanielJamesCollier Před 7 lety ⁺⁴
woudn't it be much better if you generated the random numbers before,
@DanielJamesCollier Před 7 lety ⁺¹
talking about 21:40
@TheOnlyAndreySotnikov Před 3 lety ⁺¹
16:03, not true. VTune shows you "these things".
@stevenclark2188 Před 3 lety
I suspect in a case with only a few columns, column major might even be faster, because it could use a cache line for each column thus using more of the cache. If you don't line them up so they step on each other's cache slots of course. Most of this is covered in that computer architecture course of a CS bachelors.
@dascandy Před 7 lety ⁺¹
You should've included the SSD / harddisk in your memory speed / latency discussion.
@valetprivet2232 Před 7 lety ⁺⁷
what for? If you interacting with hard disk you know it will be slow
@Slithermotion Před 5 lety ⁺⁶
I'm angry too, still wondering if a floppy disk is faster then internal RAM...haha xD
But I think it's common knowledge at this point, if you are going in such detail with coding performance then this should be common knowledge.
If you are watching a video where a guy explaines that getting data from L1 or L2 cache is time critical you better already know that ROM Memory is slow as fuck in that dimension.
@flutterwind7686 Před 4 lety
That's a rabbit hole, he can only cover the basics given the time. HashMaps -> Self Balancing Trees -> BTrees -> Esoteric Data Structures to model databases (improve disk read performance when searching) take up an entire lecture on their own.
@quoc-minhton-that6391 Před 5 lety
Can someone explain why at 56:14 , calling multiplyAdd with the array pointers offset cancels vectorisation? What data is fetched into the SSE register and how? Thank you.
@CBaggers Před 5 lety ⁺⁵
simd expects the data that will be loaded to be aligned a certain way. By offsetting slightly the compiler could either 1. use unaligned simd load instructions if the cpu supports them or 2. fall back to a non vectorized loop. On older x86 cpus unaligned loads were siginificantly slower but nowerdays it's not so bad. So option 1 may be significantly slower and 2 (almost) certainly will be.
@quoc-minhton-that6391 Před 5 lety ⁺¹
@@CBaggers Thank you!!
@childhood1888 Před 2 lety
Perfect talk except for the ums
@Vermilicious Před 2 lety
Cache lines, eh. Thanks!
@avashurov Před 3 lety ⁺⁹
I hope you’re not benchmarking memory random access speed with rand function modulus operator on every access at 20:44 🤦‍♂️
@iamlalala1995 Před 2 lety
Lol yeah i was thinking that
@expurple Před 2 lety
So, the correct thing to do is to add that rand()%n calculation into the other two loops as well? Or maybe precompute these random indexes?
Před 2 lety
Of course replacing %m with &(m-1) is much faster if m is power of 2 and if m-1 comes from const variable) but he might have kept it for readability instead.
@dat_21 Před rokem
Yeah. Easily 100 unnecessary cycles just from these two.
@elbozo5723 Před 2 lety ⁺¹
“want fast python? well tough shit”
@aritzh Před 4 lety
In the example at 20:44, wouldn't the rand() be very expensive, since it (at least on linux) uses a system call, with the context switch and everything?
@TheElectromatter Před 4 lety ⁺⁴
It is not a syscall, its a lcg in glibc
@KeithMakank3 Před 2 lety
im not sure knowing hardware helps us tremendously, we are bound by data access times whatever we do, and if its significant the impact it means hardware is slow, until we actually design hardware its not our problem.
@dupersuper1000 Před 2 lety ⁺¹
It’s really only “our problem” when high performance is a business requirement. The people who really stress out about this stuff typically work in gaming, high-frequency trading, low-level networking, etc.
The other thing to point out is that we are hitting the physical limits of processor speed and transmission of data between caches. It’s no longer possible for us to just “wait until hardware gets faster.” Either you figure out how to optimize your code and parallelize what you can on multiple threads, or you just eat dirt until some unforeseeable scientific breakthrough allows us to make faster hardware. Those are your options.
@qqqqqqqqqqqqqqq67 Před rokem ⁺²
Dude, do you think slack running slow on ryzens is because the hardware is slow?
@zhulikkulik Před rokem
You can't make one thing fit all purposes so it is “our” problem.
You can't bend physics and force electricity to run faster, but you can bend your approach to data and code and make it easier for the computer to do stuff. If someone managed to send Voyager to space in 1977 and it still operates to this day - we definitely can make a videogame/social network app/whatever to run on a hardware that's a zillion times faster than what people from NASA had 40+ years ago.
@blanamaxima Před 4 lety
Well hard because some are really bad programers.
@Antagon666 Před rokem
The comparison at the beginning is kind of lame... You don't render just 60 frames per second, those frames often consist of 2 million or more samples. In comparison 44000 samples is nearly nothing, even slowest of CPUs can emulate that in real time nowadays.
@wpavada3247 Před 7 lety ⁺⁹
so we have another thing to worry about while coding in c++, i sometimes believe it is getting too over complicated!
@NikoKauppi Před 7 lety ⁺³
It applies to both C and C++. To some extent other languages too depending on how they handle things internally. And this is something that has been present for years already. For as long as you keep your data aligned and your array accesses linear you should be good to go. But it's good to know about possible pitfalls in case you need to squeeze some performance out of some of your programs.
@wpavada3247 Před 7 lety
Can you offer me a job?!
Před 7 lety
Wp Avada nobody said it was gonna be easy 😀 just remember that C++ let's you work efficiently, but other higher level languages can't provide that
@ChrisM541 Před 2 lety ⁺¹
37:24 It's incredibly shocking how, in the relatively short space of around 30 years, the vast majority of today's programmers - even C++ programmers - have bypassed so much critical expertise in understanding assembly language. I mean it's truly...almost criminally shocking, especially given that...
1) Assembly language is at the absolute core of C++ debugging and, as demonstrated in this video
2) Different compilers can frequently produce different machine code, running at different speeds
Higher level languages may well have a more valid reason, but C/C++ programmers have - literally - no excuse.
@abhinavraghunandankannan3546 Před 7 lety ⁺²³
The material covered is decent but the way things are explained is quite bad ! It will be safe to say that the people who understand this talk would already have a decent idea of the content.
@ahmeterdem9312 Před 7 lety ⁺³
I was gonna disagree with you but when I watched until the end, I now agree. I think the purpose of the talk is more like raising some questions and curiosity into actual hardware that your c++ code running.
@Boneamps Před 7 lety ⁺¹⁰
I actually thought this talk was really good. It makes you remember a that if you want to do real proper software development of high performance applications, you MUST know about computer architecture. And this guy Timur seems to really know his stuff. For some reason, SIMD and VLIW and all that fancy stuff that processors do nowadays are quite ignored in the community. But the bottom line is, yeah this talk is not for explaining to someone who is learning. The people who attend this talk are industry experts, who probably studied all of this things but don't have the years of hand on experience in the particular topics they are listening in the talks.
@rg3412 Před 6 lety ⁺²
hmm counter is exploding. Speaker should take a Toastmaster course, that would make his presentation clearer and more engaging.
@mariogalindoq Před 5 lety ⁺⁴
Good content but I feel very distracted with hammmm aaaaaa before each sentence. He really needs to learn how to speak in public. His speech is so stressing so I can't understand a lot of sentences. Aaaam.
@Superlokkus5 Před 5 lety ⁺⁴
"You found books from computer science", yeah the fact he didn't know about and also didn't say "computer architecture" suggest he never attempted to learn a bit of formal computer science.
"C++ books never talk about this stuff" LOL are you kidding me?
*Skipping thru* Oh wow branch prediction, data alignment, array access i.e. cache locality, false sharing I never heard of it... NOT
This is the nth average talk with stuff already heard dozen times.
@gamingsportz3390 Před 5 lety ⁺²
Did all of this in my high performance class at my university
@aorusaki Před 4 lety ⁺⁶
Well some of us dont know and now we do.

Další v pořadí

Automatické přehrávání

CppCon 2016: Nicholas Ormrod “The strange details of std::string at Facebook"