A Deep Dive Into Dispatching Techniques in C++ - Jonathan Müller - CppNow 2023

Sdílet
Vložit
  • čas přidán 23. 07. 2023
  • www.cppnow.org​
    / cppnow
    ---
    Dispatching Techniques in C++ - Jonathan Müller - CppNow 2023
    Slides: github.com/boostcon
    ---
    At the core of an interpreter is a loop that iterates over instructions and executes them in order.
    This requires dispatching: based on the current instruction it needs to select different code.
    A fast interpreter requires a fast instruction dispatcher, but so does everything else that needs to switch over a fixed set of different options.
    This talk investigates at various dispatching techniques, from simple switch statements to jump tables.
    We'll look at performance analysis tools, benchmarks, and lots and lots of assembly code, in order to learn ways to trick the compiler into generating the assembly code that we actually want.
    Even if you don't need to actually write an interpreter or other dispatcher, you will learn a lot about optimization.
    ---
    Jonathan Müller
    Jonathan is a library developer at think-cell. In his spare time, he works on various C++ open source libraries for memory allocation, cache-friendly containers, or parsing. He also blogs at foonathan.net and is a member of the C++ standardization committee.
    ---
    Video Sponsors: think-cell and Bloomberg Engineering
    Audience Audio Sponsors: Innoplex and Maryland Research Institute
    ---
    Videos Filmed & Edited By Bash Films: bashfilms.com/
    CZcams Channel Managed & Optimized By Digital Medium Ltd: events.digital-medium.co.uk
    ---
    CppNow 2024
    www.cppnow.org​
    / cppnow
    ---
    #boost #cpp #cppprogramming
  • Věda a technologie

Komentáře • 10

  • @boogly3716
    @boogly3716 Před 9 měsíci

    That's a very interesting topic of which i didn't have even a slightest clue!
    Before watching i didn't even expect to see such low level stuff, but i'll surely have embroidered in my mind the following:
    1. CPU can and most of the time does instruction pipelining. Thanks to this, CPU can also increment PC (program counter) parallel to executing current instruction.
    2. CPU can do the upper only if it knows what instructions it has to do, so it does branch prediction. As in the video, this technique can be benefitial if CPU is right about predicted branches, otherwise - not. So, usual switch statement is not good enough because it is only one branch, more chaotic that can spit out only common parts of branches for the CPU. And here comes to the play threaded dispatch which provides more information to the branch predictor which instructions follow which
    3. Also, compilers are usually aware of tail calls and can use them when all the conditions for this are met. The main ones (which were noted in the vid) is that callee has the same signature as caller and neither of them has destructors.

  • @reductor_
    @reductor_ Před 9 měsíci +1

    The load for the next instruction could potentially happen at the start of the previous one (except for non-conditional jump) which might help target prediction as the load has a higher chance of being known once it's ready to execute. Would be interesting to experiment with.
    I'm surprised the compiler isn't already moving that load for the jump target up higher, it might think there is a chance of aliasing I haven't looked deeply at the code.

  • @u9vata
    @u9vata Před 8 měsíci

    I am wondering... is it possible maybe to do all/most opcodes branchless??? I am thinking maybe it is possible to come up with bytecode whose interpreter can be written mostly branchless...
    Also a bit sad that regular if statement logic was not measured... that is the non-binary-search, non-jump-table variant if all ops maybe fit in the ICACHE.
    Also what I would do is to branch first based on instuction just incrementing PC or being a jump instruction... because this is very easy to predict and you can win ++pc being always in ICACHE and pipeline-able this way while branching being "slow path". Maybe would hurt perf for very tight loops of the interpreted bytecodes - but help regular loops.

  • @AK-vx4dy
    @AK-vx4dy Před 8 měsíci

    I still wonder if jumps between direct code don't gave a best chance for branch predictor but i don't know internals of today CPUs.
    It may pollute I-CACHE but give more seprated places of jump wich could be predicted seperately.
    On old Arm you could use conditionals in instructions instead of jumps, but in AArch64 they abandoned this feature if i rember correctly
    I think x64 also has conditional mov now. If code is so small like in this example then maybe better execute all code without branches but only store result(s) conditionaly

  • @userrand4900
    @userrand4900 Před 9 měsíci

    Does anyone understand how the direct-threaded dispatch is implemented using tail calls? Specifically, how does the instruction pointer store the function address itself, does that not require that the instruction stream be generated in the binary itself? Or there is a post-processing step that overrwrites all the opcodes with the actual function addresses?

    • @foonathan
      @foonathan Před 9 měsíci +1

      > Specifically, how does the instruction pointer store the function address itself,
      Yes.
      You either need to generate the bytecode in the same binary directly, or you have a post-processing step. Either one works. Note that the post-processing step just walks over the bytecode once on startup, so there is no performance impact on the dispatch during the post-processing.

    • @userrand4900
      @userrand4900 Před 9 měsíci

      Thanks @@foonathan

    • @ZeroPlayerGame
      @ZeroPlayerGame Před 6 měsíci

      I'll add some context: direct threading was invented for FORTH, for which a lot of implementations are "live-compiler" systems - ie the compiler is incremental and knows where each function/basic op lives in memory; when you define your new function, it goes into the same image in the same address space. FORTH doesn't really do intermediate bytecode, and the artifact of code generation is a single big image blob.

  • @NickEnchev
    @NickEnchev Před 8 měsíci

    TLDR; 44:41 he uses Arch.

  • @kwinzman
    @kwinzman Před 7 měsíci

    If there are some architectures that don't support tail call optimization and that's why GCC is refusing to add an annotation like musttail attribute, then why not just have the compiler refuse to compile code with those annotations for those architectures?! Sounds like an extremely weak argument to not implement it.