Branchless Programming in C++ - Fedor Pikus - CppCon 2021

Sdílet
Vložit
  • čas přidán 4. 01. 2022
  • cppcon.org/
    github.com/CppCon/CppCon2021
    ---
    Have you ever written code like this:
    void f(bool b, long x, long& s) { if (b) s += x; }
    Probably you have. Would you like me to tell you how much performance you left on the table? With a small change, that function could be made 2.5 times faster.
    What about this code:
    if (a[i] && b[i]) do_something(); else do_something_else();
    Would you believe me if I told you that, under some not-so-exotic conditions, this line runs four times slower than it could be? It’s true, and I’m going to show you when and why.
    This presentation will explain how modern CPUs handle computations to ensure that the hardware is used efficiently (which is how you get high performance from the processor). We will then learn how conditional code disrupts the ideal flow of computations and the countermeasures employed by the CPU designers to retain good performance in the presence of such code. Sometimes these measures are sufficient, often with the help of the compiler. But when they aren’t, it is up to the programmer to recover lost performance by coding with fewer branches.
    ---
    Fedor Pikus
    Fedor G Pikus is a Chief Engineering Scientist in the Design to Silicon division of Mentor Graphics Corp (Siemens business). His earlier positions included a Senior Software Engineer at Google and a Chief Software Architect for Calibre PERC, LVS, DFM at Mentor Graphics. He joined Mentor Graphics in 1998 when he made a switch from academic research in computational physics to the software industry. Fedor is a recognized expert on high-performance computing and C++, he presented his works at CPPCon, SD West, DesignCon, in Software Development Journal, and is also an O’Reilly author. His responsibilities as a Chief Scientist include planning the long-term technical direction of Calibre products, directing and training the engineers who work on these products, design, and architecture of the software, and research in the new design and software technologies. Fedor has over 25 patents and over 100 papers and conference presentations on physics, EDA, software design, and C++ language.
    ---
    Videos Filmed & Edited by Bash Films: www.BashFilms.com
    CZcams Channel Managed by Digital Medium Ltd events.digital-medium.co.uk
    *--*
  • Věda a technologie

Komentáře • 176

  • @doc7115
    @doc7115 Před rokem +46

    That "session is over" was definitely not predicted.

  • @denispriyomov6086
    @denispriyomov6086 Před 2 lety +54

    Very interesting with a bit of a drama at the end: "Your session is over". 😏

  • @douggale5962
    @douggale5962 Před rokem +75

    I suggest you look at `sudo perf top -e branch-misses` - it will just tell you exactly where there are too many mispredicts. Hit enter on the top thing and drill down to what function it is. Build with debug info.

    • @playman350
      @playman350 Před rokem +7

      I didn't even know you could use top with perf! Learn something new everyday.

  • @happycats-go8sv
    @happycats-go8sv Před rokem +13

    1:03:30 He was ready for some more questions and then it ended suddenly with a well deserved applause. Its like games where a cut-scene would end a fight with infinitely spawning enemies. "Well"

    • @DasAntiNaziBroetchen
      @DasAntiNaziBroetchen Před 11 měsíci +7

      Man that ending is depressing.

    • @garychap8384
      @garychap8384 Před 6 měsíci +2

      Impressively, the talk ends in a Branch Prediction failure.
      Fedors reaction is amazing : ) From now on, that's exactly what I'll imagine the CPU doing.

  • @stephenjames2951
    @stephenjames2951 Před 2 lety +26

    Thanks for repeating the questions.

  • @tolkienfan1972
    @tolkienfan1972 Před 2 lety +59

    With x86 the biggest effect likely/unlikely has is to rearrange the instructions such that the unlikely branch is moved to a different area of the program. This makes the likely path a serial instruction stream, which is good for instrction cache. Also good for branch prediction in the case there is no branch prediction entry in the table. E.g. the first time thru

  • @Thiago1337
    @Thiago1337 Před 2 lety +24

    Session is over, thank you!

    • @spacelem
      @spacelem Před rokem +7

      Could have said "I'm sorry, we've run out of time, and we need to prepare the next session".

  • @binarytv2904
    @binarytv2904 Před 2 lety +7

    "The closer you sit, better the performance" - I see what you did there :)

  • @masondeross
    @masondeross Před rokem +4

    I watched this talk sometime back around when it was first uploaded, and I am almost certain I missed that joke at the beginning: "it's a talk on performance; the closer you sit, the better the performance." I am so glad YT recommended it to me again.

  • @Ryan1729
    @Ryan1729 Před 2 lety +12

    Having looked at the comments before watching the entire talk, I was a little worried the talk would end before the speaker got to close the talk. So I'll mention here that the cut off happened during the question section at the end, after the last slide. Still somewhat abrupt!

  • @serhii-ratz
    @serhii-ratz Před rokem +19

    This is absolutely mind blowing session

  • @Marius-ir1qn
    @Marius-ir1qn Před 2 lety +66

    what a way to close the session

    • @GeorgeTsiros
      @GeorgeTsiros Před rokem +18

      seriously, what's up with that? i've seen surgeries with less strict timetables
      okay, i haven't, but you get the idea

    • @spacelem
      @spacelem Před rokem +19

      That was amazingly rude. "I'm sorry, we've run out of time and need to prepare the next session" would have been fine.

    • @jake1367
      @jake1367 Před rokem +17

      could not have predicted that

    • @ernststravoblofeld
      @ernststravoblofeld Před rokem

      The Tyler Durden close.

    • @marcossidoruk8033
      @marcossidoruk8033 Před 9 měsíci +3

      ​@@GeorgeTsirosMost conferences work that way, its the norm, you know it if you have participated of one.

  • @jbar7742
    @jbar7742 Před 2 lety +90

    great talk, unfortunate end though :D can anyone tell me which CPU it was that had the faulty branch predictor?

    • @vasiliynkudryavtsev
      @vasiliynkudryavtsev Před 2 lety +1

      Does that really matter?
      Are you going to write code only for one specific CPU, e.g. ARM?

    • @jbar7742
      @jbar7742 Před 2 lety +36

      @@vasiliynkudryavtsev no, I just asked out of curiosity

    • @z140140
      @z140140 Před 2 lety +1

      any modern cpu will do the same, his code was written specifically to fool modern cpus

    • @douggale5962
      @douggale5962 Před rokem +3

      If you want to make it as hard as possible for old processors, do taken, taken, not taken, not taken, repeatedly. Every branch will mispredict. They had a two bit counter that incremented if taken, decremented if not taken, and it predicted taken if counter for this branch >= 2. Once it has been saturated one way, it has to mispredict twice in a row to start predicting the other way. Modern processors are not so dumb, they have a bunch of histories, and it selects what history based on whether the last few branches were taken or not (to get correlation).

    • @kaychubies
      @kaychubies Před 6 měsíci +1

      Best I can tell it may have been the Intel Pentium 4 Willamette, also called NetBurst. I can't find anything online that says what the speaker said -- namely reported the opposite of what it meant to -- but apparently it had pretty bad branch prediction.
      (I apologize in advanced if I've mixed up some terms. CPU architecture isn't my strong suit!)

  • @piotrarturklos
    @piotrarturklos Před 2 lety +12

    Excellent talk about branch predictor and the ways to take advantage of it. When I first saw the title though, for some reason I initially thought that the topic is about functional programming.

  • @endian675
    @endian675 Před 2 lety +16

    Awesome talk! It's rare that I would ever watch an hour-long video on CZcams, but by the time I got to the end of this one I wanted to go back to and watch it again.

    • @CppCon
      @CppCon  Před 2 lety +8

      Glad you enjoyed it!

  • @boguscoder
    @boguscoder Před 2 lety +3

    hand drawn illustrations are great touch

  • @JohnDoe-ly4ls
    @JohnDoe-ly4ls Před 2 lety +4

    It was a great ride! Thanks Fedor & @CppCon 👍

  • @GeorgeTsiros
    @GeorgeTsiros Před rokem +3

    Always happy to see Fedor!

  • @bsuperbrain
    @bsuperbrain Před rokem +1

    the art of writing efficient programs is a very good introductory book into this performance focused universe, highly recommended

  • @ugurcan969
    @ugurcan969 Před 2 lety +5

    Great talk and I have learned a lot of stuff in only 1 hour :)

  • @jimr5855
    @jimr5855 Před rokem

    Really good presentation, thank you!

  • @AhmedSam
    @AhmedSam Před 2 lety +8

    Loved it. Thanks Fedor!

    • @CppCon
      @CppCon  Před 2 lety +1

      Glad you enjoyed it!

  • @KeyserTheRedBeard
    @KeyserTheRedBeard Před rokem +3

    cool upload CppCon. I smashed that thumbs up on your video. Maintain up the excellent work.

  • @krellin
    @krellin Před měsícem

    these concepts apply not only to c++ but many other languages, very good presentation

  • @younghsiang2509
    @younghsiang2509 Před 2 lety +5

    The session is over, thank you......

  • @huzifa30
    @huzifa30 Před 2 lety +3

    thank you very much , excited talk

  • @DankAirsoft
    @DankAirsoft Před 2 lety +10

    Great summary of the subject, I learnt a lot

    • @CppCon
      @CppCon  Před 2 lety +3

      Glad to hear it!

  • @u263a3
    @u263a3 Před 2 lety +9

    The session had ended 🤣

  • @PedroOliveira-sl6nw
    @PedroOliveira-sl6nw Před 2 lety +38

    Excellent talk. Very sad to see it cut short but it is what it is. I have the book but haven't started yet. I would actually like to know/test if this can be done in GPUs too.

    • @jorcyd
      @jorcyd Před 2 lety +9

      Indeed, a lot of GPU-based implementations of Math functions (such as abs) are intrinsecally branch-free as the divergence between threads in a GPU tend to degrade performance.

    • @DagarCoH
      @DagarCoH Před rokem

      It can, and it is done. Many aspects to talk about when optimizing for GPUs

    • @SianaGearz
      @SianaGearz Před rokem +1

      No need pretty much? All shader instances within an execution unit will cause all branches for which the prerequisites are fulfilled in any of the instances to be taken (effectively by all these instances), so while at source level you branch, the underlying implementation is effectively branch-free by hardware and system design. Furthermore the cost of branching, when it rarely does happen (aggregated), is pretty much zero, as during the resulting latency window, other work is performed, another context is ready to go and switch in by then. GPUs are INSANELY good at latency hiding and maintaining high nominal utilisation. Branching at source level is actually recommended, for the case that none of the instances in an execution group take a given branch, then the whole execution group can skip it, else the whole group effectively takes it.
      In the end you don't really have much control over it and often times you have to even pay the associated extra cost of branch-free execution (those other branches being executed as well) in spite of nominally branching.
      It all even goes back to the very first shader model which had no branching provision at all, at machine language level there were only selective mov instructions, no jump instructions; while shader compiler allowed and recognised branches at source level and converted them correspondingly. Loops had to have a trivially provable max iteration count and would be fully unrolled by the compiler, naturally.

  • @innovationscode9909
    @innovationscode9909 Před 2 lety +2

    Marvellous. Thanks. Deep thinking...

    • @CppCon
      @CppCon  Před 2 lety

      You're most welcome

  • @leshommesdupilly
    @leshommesdupilly Před rokem +1

    Cpu and compiler engineers are mad geniuses lol

  • @douggale5962
    @douggale5962 Před rokem +1

    Nice to see a talk like this when there are so many people scoffing at branch free, because they saw one example where a perfectly predictable branch skipped a few ops and ended up slightly faster.

    • @lepidoptera9337
      @lepidoptera9337 Před rokem

      How many times in your life did you need those 0.3ns back? :-)

  • @serhii-ratz
    @serhii-ratz Před rokem +1

    “Predictor is good in prediction. We are not“ - nice :)

  • @SerFluffikins
    @SerFluffikins Před 2 lety +1

    In examples 1 and 2: isn't there an additional data dependency because both the 'if' and the 'else' branch write to a1? It seems a2 remains at 0 in those examples.

  • @BenjaminBuch
    @BenjaminBuch Před rokem

    Very great talk, thank you!

  • @christophclear1438
    @christophclear1438 Před 2 lety +23

    Great talk!
    I would have liked to ask if the array-of-two-bool-index micro-optimization for removing a branch is generally preferable to what I would have done more naively, namely adding or or'ing the two values after multiplying them with 1 and 0, gotten by treating the bool as an int.

    • @EgD996
      @EgD996 Před 2 lety +21

      I guess the answer could be: measure it

  • @bobby9568
    @bobby9568 Před 2 lety +3

    Really well presented!

    • @CppCon
      @CppCon  Před 2 lety

      Thank you kindly!

    • @CppCon
      @CppCon  Před 2 lety

      Thank you kindly!

  • @toufikoran8416
    @toufikoran8416 Před 2 lety +5

    very informative
    thanks

    • @CppCon
      @CppCon  Před 2 lety +1

      Glad it was helpful!

  • @MMIIRRKKOO
    @MMIIRRKKOO Před 2 lety +2

    Great talk!

  • @OmarChida
    @OmarChida Před 2 lety +14

    Great talk like always quality content from Fedor! I will definitely buy his book as I'm rly interested in these type of optimisations.
    Also can someone explain why the optimisation with function pointers doesn't work when functions are inlined?

    • @rauldragu9447
      @rauldragu9447 Před 2 lety +9

      Functions wouldn't work *ever* . A function call is a jump, so basically a branch. You could get the pointer branch-lessly, but then you would have to stall until the actual instructions that make up the function are loaded. It's basically like loading an array. No matter how fast you can get the pointer to that array, when you dereference it to use it, you still have to wait for the array to get cached. Same with functions. And if you just had if (cond) foo(); else bar(); you would always have a fast branch and a very slow branch, but if you introduce indirection with the function pointer you always have a pretty slow branch by having to cache the function instructions every single time, only after you know where they are stored. If the condition was sufficiently unpredictable, then it may be worth it, but it seems unlikely.
      And for the inlined function it's even worse because an inlined function doesn't need to build a new stack frame since all the instructions are right there, inlined. Maybe even the caching of instructions is faster this way. If you use pointers you are forcing the compiler not to inline since it needs to be able to point to anything.

    • @qwertyuiop-cu2ve
      @qwertyuiop-cu2ve Před 8 měsíci +1

      @@rauldragu9447 I think what may work is call both functions, store the results in the array, then use the condition to index into the result of the correct function. But if the function is more expensive than a branch misprediction on deciding which function to call, then you should just use the branch. Like Fedor Pikus said, measure! If performance doesn't matter enough for you to make a benchmark and measure, then you shouldn't do any special tricks either.

  • @uy-ge3dm
    @uy-ge3dm Před 2 lety +14

    Surprised that the humble s += b*x could be so much more efficient. I've been using that one for years

    • @jbar7742
      @jbar7742 Před 2 lety +4

      though i would have really expected the compiler to pick up on this

    • @47Mortuus
      @47Mortuus Před 2 lety +12

      whats even better is a negation of either 1 or 0 with a bitwise AND (-1 is all one-bits). 2 clock cycle latency (NEG; AND) vs 3-4 (multiply):
      s += -b & x;

    • @DasAntiNaziBroetchen
      @DasAntiNaziBroetchen Před 11 měsíci

      @@47Mortuus I assume that one only makes sense if you are adding integers.

  • @Minty_Meeo
    @Minty_Meeo Před rokem +5

    In PowerPC, the branch conditional instruction has the prediction baked into it. I've seen, for example, MetroWerks compiler output where its static predictions were very primitive (99% of branches forward were unlikely, 99% of branches backward were likely). I've yet to use the C++20 attributes, but [[likely]] and [[unlikely]] probably give you manual control of the prediction bit for those branches, which is neat.
    Debug assertions and nullptr checks were totally the first thing I thought of when I learned about these attributes. Even if the compiler is smart enough to recognize a nullptr check and mark it as unlikely (I'm sure it is), it is nice to be able to self-document it with an attribute.

    • @marcossidoruk8033
      @marcossidoruk8033 Před 9 měsíci

      Yes the compiler will always assign a lower probability to a null pointer check indeed.
      And no, never use likely or unlikely without profiling and looking at the assembly, the C++ standard is ass and says that likely/unlikely means "arbitrarily likely" and what that means in clang and gcc is 80%/20% respectively. It turns out that because of things such as null pointer checks and early returns the compiler can assign to a branch a probability lower than 20% but because unlikely means always 20% marking it as unlikely will actually make it more likely.
      Again, the C++ standard is complete dogshit and it should have a feature to tell the compiler exactly how likely it is ([[likely 5%]] or something like that), but it doesn't and thus unlikely may jot do what you expect it to do.

    • @mytech6779
      @mytech6779 Před 6 měsíci

      @@marcossidoruk8033 That is much more of an implementation failure than a standard failure.

  • @BowMcGee
    @BowMcGee Před 2 lety +21

    Super interesting topic. Great presentation. I wonder where the most to gain is with current software (current hardware is criminally underused due to bad software). Is it branch prediction, is it parallelism, is it cache oriented code? Or is it something else?

    • @dmitrykargin4060
      @dmitrykargin4060 Před 2 lety +10

      Software is vastly different. We develop solvers for different forms of sparse linear equations. This particular case of algorithms is often memory-bound. CPU spends 100 cycles for next portion of data to be loaded from RAM and about 10 cycles to process.
      So branch misprediction is not a problem in our case. Cache misses and memory speed are.

    • @xeridea
      @xeridea Před 2 lety +3

      Depends on the code. General software probably branch prediction. Video, audio, graphics, or other things easily parallelizable probably parallelism and cache. Using SIMD instructions can easily provide an absolutely massive performance gain (perhaps 5-8x) where it makes sense. Multithreading can help with a lot of apps, not just those capable of benefiting from SIMD, though still not always a perfect or easy solution. Cache optimization can benefit, there are some general practices to prevent obvious slowdowns (such as how you iterate multi dimensional arrays), but can be a bit tricky otherwise.

    • @iuppiterzeus9663
      @iuppiterzeus9663 Před 2 lety +3

      I'd say memory layout and parallelization

    • @bva0
      @bva0 Před 2 lety

      @@dmitrykargin4060 I find this quite interesting. Do you work with direct solvers or iterative solvers?

    • @SianaGearz
      @SianaGearz Před 2 lety

      I would say based on relative performance of different processors with different implementations of branch prediction, different number of cores, different SIMD configurations, different cache capacities: all of the above.
      But branch prediction has a knock on effect on everything, like your high memory latency through the cache hierarchy doesn't really matter so much if you don't do useless requests.
      On the other hand, bad memory locality just has such a devastating impact on performance that it's really unfortunate to ignore.

  • @nitsanbh
    @nitsanbh Před měsícem

    18:30 register aliasing blew my mind
    I had no idea eax is a logical register!

  • @omarkhan6217
    @omarkhan6217 Před 2 lety +3

    give him another hour

  • @robertjmccabe
    @robertjmccabe Před 2 lety +1

    Very creative talk

  • @MarcoBergamin
    @MarcoBergamin Před rokem

    Very interesting, thanks

    • @CppCon
      @CppCon  Před rokem

      Glad you think so!

  • @fzz5964
    @fzz5964 Před 2 lety +1

    cool!

  • @ciCCapROSTi
    @ciCCapROSTi Před rokem +1

    Really great topic and good info from Fedor, but his style takes some getting used to.

  • @jieliu756
    @jieliu756 Před rokem

    Thank you CppCon. Is the presentation slides available anywhere?

    • @CppCon
      @CppCon  Před rokem

      Thank you for your comment. We are investigating this and hope to resolve the issue shortly.

    • @AlFredo-sx2yy
      @AlFredo-sx2yy Před 9 měsíci

      @@CppCon idk what issue you're even trying to resolve lmao. They asked for the slides, this aint a bug report :s gotta love automated bot comments...

  • @47Mortuus
    @47Mortuus Před 2 lety +10

    The part about the compiler built in likely/unlikely hints is completely off.
    it is just meta information for the resulting ORDERING OF INSTRUCTIONS in the machine code. This is due to the programmer wanting to avoid (instruction) cache misses, since the branch marked as more likely will immediately follow the conditional check.
    Furthermore, the unlikely branches may be put at the very very end of a particular procedure (function) or loop with an unconditional jump back to where the branch ends. It may even prevent function inlining in order to improve performance that way (yes, sometimes forcing not to inline improves performance).

  • @vladimir0rus
    @vladimir0rus Před 2 lety +6

    "Your session is over" :(

  • @leonid998
    @leonid998 Před měsícem

    18:57 shoud lt be register renaming?

  • @bishop3000home
    @bishop3000home Před 2 lety +10

    Serious question: if we repeat all the time ‘measure before changing’ and that compilers and processors may do better work then why do we think, that after we changed code once, it stays the fastest code? We made code less readable, we removed branch and added more work.
    Then what if new processor came or new optimization in a compiler etc and original code would be better?

    • @SianaGearz
      @SianaGearz Před 2 lety +4

      You cannot strictly rely on the code being perpetually faster indeed, but what choice do you have? You can't optimise for the unknown, and pessimising software in the hope that an architecture comes along someday that magically makes your slow code fast obviously doesn't help either.
      One somewhat warranted assumption you can make is that processors will be optimised towards current, existing code; that they will be similar to today's processors just better and faster, that they will implement similar techniques as today but make them more robust. That processors when they get released and benchmarked by the press, will have to compete on old code, which will be a mix of code profile-optimised for last-gen processors and canonical style code.
      Another related assumption is that even if the processors open up a different way to do things, old way will not be slower in absolute terms, just in relative terms. So say if your software has minimum specification being an AMD Bulldozer, and you optimise towards that, you have eliminated the performance weakness on the weakest hardware that you target, and that's enough, you don't actually care that the software fails to perform much quicker on a nominally faster processor, you have created a software that is palatable for the entire userbase. I have optimised towards Pentium 4 in the past (well over 10 years ago) knowing that it's the chip family most likely to struggle among all processors people are likely to use, and it didn't make nearly as much of a difference either way on anything like an Athlon XP or a Pentium M or a Core.
      Obviously neither of these assumptions is absolutely true but they are useful as guiding principles in absence of better information.
      Of course if you have the freedom - and there exists code which actually does that, though it's exceptionally rare - you can have several different implementations in your software, and profile them when you detect system configuration change or on startup. Of course it has its own hazards as you introduce an indirection at entry into that code, so better be sure it's worth it. Less risky is having the benchmark of optimised vs. canonical code automated in your test suite outside the shipped software.
      I do think performance benchmark should be part of your CI procedure, because if there are performance regressions - usually an undesired side effect of development such as feature work or bugfixing rather than any sort of platform changes - you'll know about them. These performance regressions are a day to day reality rather than a hypothetical differently behaving platform to come along someday, and your software will generally gradually turn into a disaster if you ignore them.

    • @Adowrath
      @Adowrath Před 2 lety +2

      That's why you don't just measure once, change, and then leave it. You continually measure. If you upgrade your compiler and there's a 5% performance drop, you measure, find the (potentially new) bottleneck, and change.
      Don't forget that also 99% of the time you shouldn't need to think about this in the first place.

    • @mytech6779
      @mytech6779 Před 6 měsíci

      There are many cases where the hardware target is well known and exact. For example an embedded device, controller for some factory, or a transportation system (eg flight mangement sytems cannot make hardware version changes without major testing); orat the big end when you are optimizing for a super computer, a mainframe, or a large server farm deployment (matching hardware) the electrical costs alone can more than offset the cost of targeted optimization and a custom compile.
      (These cases spend tens of thousands of USD per month on electricity. A 1% efficiency gain during a year is worth a week of developer time.)
      In a few cases the hardware may be purchased after saftware profiling and optimization results that were made on a selection of sample machines.

  • @LorenzBucher
    @LorenzBucher Před 2 lety +3

    Interesting and informative talk! I like the hands-on, example-driven approach.
    What I don’t like is the constant interruptions (esp. ~ mins 30-40) from the audience questions. These are very hard to follow as a remote viewer and disrupt the flow.

    • @vasiapatov4544
      @vasiapatov4544 Před 2 lety +4

      I agree that it reduces the quality of our experience as remote viewers, but you have to remember that these conferences are optimized for the experience of the in-person attendees (as they should be, it's an organic, and expensive event)

  • @skybuck2000
    @skybuck2000 Před 2 lety

    Why does the function pointer trick not work ? Perhaps expensive memory look up for the function ?

    • @GordonAitchJay
      @GordonAitchJay Před 2 lety +2

      Fedor explains why in the last 3 paragraphs of chapter 3: CPU Architecture, Resources, and Performance - Branchless computing. In short, function pointer calls prevent inlining, which is a more beneficial optimisation than avoiding mispredicted branches.

  • @VitisCZ
    @VitisCZ Před 4 měsíci

    31:21 anyone knows what CPU he is talking about? I'm quite curious

  • @multiHappyHacker
    @multiHappyHacker Před 2 lety

    I so wanted the bool conditional to be the solution at: @44:12

  • @HFsrj
    @HFsrj Před 2 lety +18

    In the first example, how do you get 10% only misprediction, if the predicate is random ?

    • @theeinhaender5132
      @theeinhaender5132 Před 2 lety +2

      There are multiple branches. For example, there could be four branches that can be predicted with ~100% accuracy, and the fifth and final branch is predicted with 50% accuracy. Or just one branch that’s 100% accurate and it just gets evaluated 4x more often than the 50% one.

    • @giacomo.delazzari
      @giacomo.delazzari Před 2 lety +4

      Perf says that 10% of the branches are mispredicted. However the "interesting" branch was just one of the many branches in the program. There's quite a lot of other code involved, think about the benchmarking framework being used.. For every benchmark instance it might be running quite a bit of logic. Also the for loop termination branch is present. Also, the arrays are filled up before doing the experiment itself, and this also involves some branches (for loop and stuff). All of these branches are well predicted, so the "interesting" ones that are mispredicted are just a part of all the branches of the program.

    • @meekrab9027
      @meekrab9027 Před 2 lety

      It is all branches for the entire program, which includes a lot more than just the code being benchmarked.

    • @tomasdzetkulic9871
      @tomasdzetkulic9871 Před 2 lety

      there are other branches in the code. There are branches in the loop around the `if` statement. There are branches in the benchmark framework etc.

    • @froody7
      @froody7 Před 2 lety

      10% of all branches executed were mispredicted, but the branch in question is only one of several executed in the loop, so it might be 50% mispredicted but when aggregated with a bunch of predicted branches at other sites the final rate could be 10%.

  • @shilangyu
    @shilangyu Před rokem

    Why code like `rand() & 0x1` generates branches that can be missed? Doesnt this piece of code perform a branchless stream of instruction (with some bitwise-and call)?

    • @DasAntiNaziBroetchen
      @DasAntiNaziBroetchen Před 11 měsíci

      I'm not an expert but it depends on the implementation of "rand()" and it may not be inlined by the compiler

  • @tomaszstanislawski457
    @tomaszstanislawski457 Před 2 lety +3

    It can't optimize `c[i] = rand() >= 0` because function `rand()` is treated as "deterministic" random generator. The internal state of RNG must be advanced even though the returned value is discarded. The best one can get is `rand(); c[i] = 1;`

    • @somehrjdbddhsjsbnsndndk751
      @somehrjdbddhsjsbnsndndk751 Před 2 lety

      Deterministic, I don't get it and I don't understand why?

    • @gianni50725
      @gianni50725 Před 2 lety

      @@somehrjdbddhsjsbnsndndk751 Computers, at least without special hardware, are incapable of making truly random numbers. So rand() with the same given seed will give you the same sequence of "random" numbers every time you call it.
      If the compiler optimized out the rand() call, then it would be violating one of the rules of the compiler since it would be affecting the output of the program.

    • @niles_5003
      @niles_5003 Před rokem

      @@somehrjdbddhsjsbnsndndk751 Tomasz was saying that `rand()` has side-effects, so it can't be eliminated entirely. This is because `rand()` has an internal state that holds that last randomly generated number to generate the next one. If you didn't call rand() at all, that internal state would not be changed properly, and that could result in observable differences between optimized code and not optimized code, namely that the sequence of random numbers would be generated for a given seed would be applied to different vectors in his program.
      Now, of course, the programmer might not care about that, and would like this kind of optimization to occur, but the compiler has no way of asking or knowing that the programmer does not care about it.
      Regardless, I would argue that a compiler warning would be better in this context than an optimization. If I had some condition buried in my code that was always true, then that condition is probably written incorrectly or I could delete it myself. There are some tools which will give you a warning when a condition is always true, but often times, the tool does not have a type that can describe what the real outputs look like. I.e. if something returns an "int" or a "number" rather than a "positive integer", the tool has no way to determine that the returned value will always be greater than or equal to 0.

  • @Revoker1221
    @Revoker1221 Před 2 lety +7

    Another talk related to this speculative type branching topic was one given by Chandler Carruth here:
    czcams.com/video/2EWejmkKlxs/video.html
    I'd recommend giving it a watch first if some of the foundational material at the beginning of this talk seemed hard to understand. It's a similar talk to this one except with a focus on loops rather than if-else branches.

  • @KX36
    @KX36 Před rokem

    My low-mid (second from the bottom level) level interview was 70% leadership to start with, followed by 30% extremely basic technical. I didn't expect leadership questions at such a low level, and I messed up that interview so hard, but fortunately they hired all 4 people that made it to interview so it's all good!

    • @lepidoptera9337
      @lepidoptera9337 Před rokem

      You know why, right? Because an extremely technical person has a tendency to get lost in minutia and might spend hours, days and sometimes months thinking about how they can save a few nanoseconds in a program that needs to have 100% uptime, instead. Like this guy.

    • @DasAntiNaziBroetchen
      @DasAntiNaziBroetchen Před 11 měsíci

      @@lepidoptera9337 lol whatever makes you feel better.

  • @chriswysocki8816
    @chriswysocki8816 Před 2 lety +1

    1. this video was made in 2021, with supposedly not an ancient CPU in the test system. I've heard so much about modern CPUs having redundant pipelines that keep evaluating both sides of the branch (and still keeping the branch prediction circuitry, in case the pipelines are "shorter" than the branch code paths). If that's true, why doesn't that make the handling of the ASM code generated by the compiler much more efficient? 10% (minority) wrong predictions should still lead to high efficiency.
    2. why isn't the C compiler taking advantage of the SIMD in the fastest version of C branchless? Isn't there some compiler option you could have turned on to get the compiler to make code that performs closer to the optimal ASM impl?
    .... are compiler implementors and CPU designers lying to us or are they optimizing to unrepresentative/narrow test cases?

    • @simivb
      @simivb Před rokem +1

      1. Evaluating both sides of the branch is gonna perform worse than branch prediction for the vast majority of branches. As Fedor said in the talk, 1% mispredict is already a really bad case, so if you just take both paths always, you do unnecessary work for 99% of all branches. So 1% of your code becomes faster, 99% of it becomes slower. This is not good enough to use as a general solution. I also don't know what CPUs you are talking about.
      2. You can tell your compiler to autovectorize using flags, but I believe it's not done by default. One reason for this is that available SIMD instruction sets differ between processors, so the compiler will of course avoid generating code that your CPU may not even be able to run by default. So you have to tell it what SIMD ISA you are targeting. It also absolutely can't do as good of a job as optimal assembly, because 1) it doesn't have as much information as you, so it can't make the same optimizations as you. The guarantees the code makes are different than the guarantees that the algorithm makes. And the strongest optimizations often change how the program runs internally, which the compiler can not do, because it only does transformations that keep the observable program behaviour identical. And 2) SIMD code needs certain preconditions to excell, which are often not present in normal code. So the "optimal assembly" might involve preparing the data in a different way and transforming it in a different, and that code might be scatter across multiple functions. A programmer can make these changes because they can reason that the end result is identical, but it might be way too many instructions for the optimizer to reason about.
      There really is no reason to attribute malice to compiler or CPU vendors here.

    • @lepidoptera9337
      @lepidoptera9337 Před rokem

      If you depend on a machine that needs to be 99% efficient to keep up with your problem, then you have a poor hardware design to begin with that will fail the next time your requirements change. In most applications the CPU is waiting most of the time, so why in the world do you care? Gaming didn't get faster because they optimized the hell out of branches. It got faster because they are selling GPUs.

  • @AbstractObserver
    @AbstractObserver Před 2 lety +4

    How they managed to have an in person convention in 2021?

  • @jean-luclachance7242
    @jean-luclachance7242 Před rokem

    Why can't compilers do this type of optimization?

  • @PrzemyslawSliwinski
    @PrzemyslawSliwinski Před 2 lety +2

    53:10 - can this technique be used to make an effective Meltdown/Spectre-proof code?

  • @user-nm7sp5xj7q
    @user-nm7sp5xj7q Před 2 lety +1

    2:58 beginning

  • @PiotrKundu
    @PiotrKundu Před měsícem

    1:03:28 EOF

  • @edgararakelyan9326
    @edgararakelyan9326 Před rokem +3

    Awkward ending to say the least

  • @none_of_your_business
    @none_of_your_business Před 2 lety +1

    bool(x)+bool(y) is a bit scary, booleans shouldn't have arithmetic adition defined, it should have only boolean operations defined.

  • @skybuck2000
    @skybuck2000 Před 2 lety +2

    Do your cats own laptops ? LOL.

  • @mytech6779
    @mytech6779 Před 6 měsíci

    Just ran perf stat on python3 hello world and got 6% branch misses. pypy3 had 3% but way more total branches...
    g++ or clang++ with -O0 or -O3 and I still get 2.5% misses(though a lot less total branches, 800k vs 30M) using iostream with endl
    same branches and misses for printf()
    Also 3M instructions for printing hello world?
    hmmm... 950k instructions and 220k branches for ` int main(){ return 0;} ` must be overhead from perf or the OS, the assembly is basically zero eax then ret

    • @mytech6779
      @mytech6779 Před 6 měsíci

      A;so just tested some assembly loops. I'd say around 10G instructions or 1G branches per run is a good level to washout the bulk of the overhead. Maybe bump that an order of magnitude for fine tuning(but then your also going to want some control over cpu core-thread scheduling to prevent context switching from molesting your cache.)

  • @XKS99
    @XKS99 Před rokem +1

    It’s crazy the quality of people Russia has lost

  • @lepidoptera9337
    @lepidoptera9337 Před rokem

    You know that you have no real problems in your life if you are focused on the branch prediction pipeline of your CPU. ;-)

  • @736939
    @736939 Před 2 lety

    How to find a job for a C++ developer? I'm not joking, I'm a dotnet developer, and I'm asking the serios quesion. Companies (as usual) want to see professional C++ developers after they finished university, which is impossible.

    • @BarafuAlbino
      @BarafuAlbino Před rokem +2

      Ignore all conditions written in the vacancy description. Most companies don't care what they have written there.

  • @kilasuelika6955
    @kilasuelika6955 Před 2 lety +8

    Let me make a summary: one can use a function table to replace branches.

    • @greob
      @greob Před 2 lety +6

      58:15 didn't he say that doesn't work though?

    • @iUniversEi
      @iUniversEi Před 2 lety +3

      That would be a really bad summary, because this is something that does NOT work, as presented in the talk.

    • @kebien6020
      @kebien6020 Před 2 lety +4

      He specifically mentioned this as something that almost never works 58:19

    • @Derheight
      @Derheight Před 2 lety +2

      I think a function table is a kind of branching mechanism because the jump address depends on a runtime value which must be predicted to avoid flushing the pipeline.

    • @MMIIRRKKOO
      @MMIIRRKKOO Před 2 lety +2

      Function results table, forcing you to evaluate everything. He specifically said that a table of function pointers is usually bad, and if those can be inlined, it's usually terrible.

  • @paulfunigga
    @paulfunigga Před rokem +1

    you write this kind of code and then some other person will look at your code and be like "was this guy on drugs when he wrote this?". I think all this makes C++ a horrid language