CppCon 2016: Chandler Carruth “High Performance Code 201: Hybrid Data Structures"

Sdílet
Vložit
  • čas přidán 30. 09. 2016
  • CppCon.org
    -
    Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/cppcon/cppcon2016
    -
    Modern programs’ performance characteristics are often dictated by their data. Whether the cache locality of data access, the size of working set, or avoiding costly memory allocation overhead. Unfortunately, the standard C++ library data structures range from adequate to terrible at controlling these aspects, and they don’t provide any of the core mechanisms needed for extremely efficient data structure design.
    This talk will present the core concepts of designing high performance data structures in C++. It is based on years of experience in the LLVM compiler as well as several other large code bases. From these principles, the talk will propose a suite of data structures that provide performance without loss of generality or functionality. As much as this talk will present specific data structure designs, its primary intent will be to give an understanding of what makes these structures have greater performance than more naive approaches.
    -
    Chandler Carruth
    Google
    C++ Lead
    San Francisco Bay Area
    Chandler Carruth leads the Clang team at Google, building better diagnostics, tools, and more. Previously, he worked on several pieces of Google’s distributed build system. He makes guest appearances helping to maintain a few core C++ libraries across Google’s codebase, and is active in the LLVM and Clang open source communities. He received his M.S. and B.S. in Computer Science from Wake Forest University, but disavows all knowledge of the contents of his Master’s thesis. He is regularly found drinking Cherry Coke Zero in the daytime and pontificating over a single malt scotch in the evening.
    -
    Videos Filmed & Edited by Bash Films: www.BashFilms.com
    *-----*
    Register Now For CppCon 2022: cppcon.org/registration/
    *-----*

Komentáře • 71

  • @MarcusAseth
    @MarcusAseth Před 6 lety +111

    man, Chandler makes the camera man work really hard for his money xD

  • @kamilziemian995
    @kamilziemian995 Před rokem +10

    Chandler Carruth talks are great. I try every one of them available on the YT.

  • @teekwick
    @teekwick Před 7 lety +18

    Chandler is my favorite speaker after Herb, good talk!

    • @seditt5146
      @seditt5146 Před 5 lety +1

      I agree, another one worth catching all his talks is Ansel Sermersheim. He makes hard to understand subjects highly consumable. Timur Doumler also has good talks but he is not the best public speaker, not the worst, he is very good, but not Herb or Chandler levels at all.

  • @barrettellisadair
    @barrettellisadair Před 7 lety +7

    I always enjoy Chandler's talks - this one has great video/audio quality as well.
    I would like to emphasize the fact that this pointer-packing stuff is implementation-defined, and generally pointless outside of a performance bottleneck. So, some of the code you don't see in the slides likely contains a lot of platform-specific configuration macros (but I haven't looked at these parts of the Clang code base yet).
    TL;DW: Cater to the cache, pack your pointers.

    • @MrAbrazildo
      @MrAbrazildo Před 7 lety

      I supose you are talking about the 4-bits-packing stuff, instead of the array of unique pointers, for large objects, right?

    • @10se1ucgo
      @10se1ucgo Před 7 lety

      Agreed, he's definitely my favorite presenter.

  • @soonts
    @soonts Před 7 lety +9

    @20:00 Microsoft has very similar implementation in their ATL, since 2003. Their implementation is called CAtlPlex, and is used internally by many ATL containers.

    • @BlairdBlaird
      @BlairdBlaird Před 2 lety +1

      Apple also uses that at the language level (of both ObjC and Swift IIRC). In fact on 64b they use not only the low bits but also the high bits, because all extant 64b implementations only have 48b address space at the moment (and x86-64 is architecturally limited to 52 bits).

  • @davidforgeas2235
    @davidforgeas2235 Před 7 lety +4

    Is there a significant performance benefit to use the hybrid 32-bit pointers and 64-bit registers ABI, namely x32 ABI? I don't mean for the generated code, but specifically for the compiler itself?

    • @SianaGearz
      @SianaGearz Před 7 lety +4

      So you get an expanded register file with 8 extra registers, but register scarcity has not been a major issue since about Pentium Pro, since top of stack is actually aliased to an unknown number of hidden registers and the corresponding operations get removed at the opcode decoding stage. You'd get a slight bottleneck at opcode decoding but it was pretty much removed in the later iterations of Pentium III.
      There also doesn't seem to be any difference whatsoever in compiler performance between pure 32-bit and 64-bit ABIs, at least from my experience with GCC - don't know whether it extends to Clang, but I strongly suspect so.

    • @dat_21
      @dat_21 Před 6 lety

      Can you elaborate on "top of stack" aliasing? I don't find any mentions of it in intel optimization guide or Agner Fog's guides. Maybe you are talking about FP register stack?

  • @TheEmT33
    @TheEmT33 Před 5 lety +2

    i love this guy

  • @insideloop22
    @insideloop22 Před 7 lety +5

    As usual, it is a joy to listen to Chandler's talks. I am working on an Open Source library of data structures for High Performance Computing and one of the first thing I had to do is to rebuilt a custom std::vector. One of the many reason I had to do that is that std::vector v(n); fills the vector with zeros which trashes the cache and prevents you to use the "first touch policy" which is extremely important for NUMA architectures such as dual socket nodes. Alignement is also an issue to get efficient vectorization in HPC. These things could be "solved" with allocators, but I just hate them as I don't want them to change the type of my containers.
    I have a few questions though:
    - Do you have a llvm::Vector or do you use llvm::SmallVector as a replacement for std::vector?
    Do you also use llvm::Vector as a replacement of std::array?
    - There is a BumpPtrAllocator, but is does not seem possible to use it with SmallVector. Do you have any design ideas for allocators? I just can't get them to work the way I way. I hate both the way they are done in the C++ standard library and the Bloomberg implementation.

    • @ChandlerCarruth
      @ChandlerCarruth Před 7 lety +8

      We use SmallVector pretty much everywhere. On very rare occasions we'll want what std::vector provides, but it's unusual.
      std::array is a totally different beast -- it is a statically sized array which has essentially no overlap with the vector use cases. I suspect there are some places where we could use std::array and currently use llvm::SmallVector, but I think they're rare.

    • @ChandlerCarruth
      @ChandlerCarruth Před 7 lety +2

      And regarding allocators -- for LLVM, the pattern we fall into more often is needing to control allocation of the *objects* with allocators, not the buffers in the data structures. For those buffers, tcmalloc, jemalloc, or any of the other optimized free-list based malloc libraries will do a fine job.

    • @henrikvallgren1943
      @henrikvallgren1943 Před 7 lety

      I tried to get attention to this issue a while back:
      groups.google.com/a/isocpp.org/forum/#!msg/std-proposals/b946VongJjY/RYYj--v63qsJ

    • @insideloop22
      @insideloop22 Před 7 lety +5

      There are too many things in the standard library which are broken for me and I am glad I decided to write my own containers. It provides me so many useful things. As I am working mainly on numerical simulation, all I really need is a good std::vector implementation. Having my own allows me to:
      - Have an il::Vector v(n) that does not fills the vector with 0 (same with float, int, etc)
      - Have an il::Vector v(n) that gets initialized to NaN in debug mode (very useful if you want to track uninitialized values)
      - Add my own constructor such as il::Vector v(n, il::align, 64) if I want to align the buffer to a cacheline
      - Easily count the number of copy constructor called in il::Vector (if I want to make sure that the move semantics are properly done)
      - Having il::Vector::size() return a signed integer (I hate unsigned integers and all the bugs they lead to)
      - Write a il::SmallVector as Chandler has pointed
      - so many useful things...
      I like the language C++, but I just have so many troubles with the standard library. I am glad to hear that LLVM worked on their own containers. I can point people who say I am stupid not to use the standard library to this kind of talk.

    • @MrAbrazildo
      @MrAbrazildo Před 7 lety +1

      So, did you already write your own, or just want to?

  • @tribalfromthetrapdontrappi3030

    Nice!

  • @enhex
    @enhex Před 7 lety +2

    12:00 Can alias templates be used to avoid those mistakes?

    • @ChandlerCarruth
      @ChandlerCarruth Před 7 lety +3

      Yes, but at the expense of having to come up with a new name for each different small-size-optimized container in a function's API boundary. =/ Doesn't seem to scale very well.

    • @xfreeman86
      @xfreeman86 Před 5 lety

      You can use the same technique you use to remove the size parameter from the type of SmallVectorImpl. Allocators are passed by reference, so no slicing worries, and you can use the pimpl idiom for polymorphic copying. My jury is still out on whether it's better, but it is possible.

  • @Cons-Cat
    @Cons-Cat Před 2 lety +2

    I'm quite sure that the paramaterized allocators problem is easily solvable with C++20's `concept` feature now.

    • @ivanzhechev4215
      @ivanzhechev4215 Před 11 měsíci +1

      Slowly learning C++20, but how would you achieve that? The only way I can think of is rejecting incompatible allocators and giving a compiler error, which doesn't really achieve the result of having the SmallVector on API boundaries.

  • @JohnDlugosz
    @JohnDlugosz Před 4 lety +2

    6:17 Slide 6 -- How does the SmallVectorImpl::push_back know whether to free the old buffer? Without knowing the original N, it can't tell if it's been reallocated to the heap already, or was still using the small buffer.

    • @pazdziochowaty
      @pazdziochowaty Před 4 lety +3

      It can check if Begin points right after SmallVectorImpl object - this is where derived class keeps the buffer

  • @max0x7ba
    @max0x7ba Před 7 lety

    Bug in slide 21: `(Size >= (End - CurPtr))` should be `(Size > (End - CurPtr))`.

    • @stdafx
      @stdafx Před 5 lety +2

      In the original source code it looks a bit different: "if (Adjustment + SizeToAllocate

  • @spudtaters8419
    @spudtaters8419 Před 7 lety +3

    The smallvector slide at 5:26 is confusing me because of omissions: 1) Can all small size vectors be upgraded to heap storage on overflow 2) Is SmallVectorImpl what you use for function params? I assume so, it just sounds like a wierd / wordier name.

    • @insideloop22
      @insideloop22 Před 7 lety +1

      1) Yes, with SmallVector v; they can call v.resize(10);
      2) The name is a bit weird, but I don't know which one they could have chosen

    • @xfreeman86
      @xfreeman86 Před 5 lety +2

      For 2), the names should have been switched. Then you allocate SmallVectorImpl but pass around SmallVector.

    • @sebastianmestre8971
      @sebastianmestre8971 Před 5 lety

      @@xfreeman86 what about allocating SmallVector and passing around BasicSmallVector

  • @SolomonUcko
    @SolomonUcko Před 3 lety +1

    Is this the previous talk? czcams.com/video/fHNmRkzxHWs/video.html ("Efficiency with Algorithms, Performance with Data Structures")

  • @wintermute701
    @wintermute701 Před 7 lety +5

    The talk from 2 years ago he mentions in the beginning: czcams.com/video/fHNmRkzxHWs/video.html

  • @Shockszzbyyous
    @Shockszzbyyous Před 7 lety

    ok i see so the idea behind smallvector is that you can allocate some space before hand? instead of allocating when pushing something on to the vector array ?

    • @tomcheng3903
      @tomcheng3903 Před 6 lety +7

      No, the point is that you allocate it on the stack, not the heap. If it's too large for the stack, then it'll fallback to the heap.
      You can allocate space in a vector beforehand with vector::reserve().

  • @onosendaiAAB
    @onosendaiAAB Před 7 lety +3

    Hi Chandler,
    I may be misinterpreting you, but at czcams.com/video/vElZc6zSIXM/video.html
    are you implying that modern CPUs do indirect memory prefetching?
    e.g. are you implying that CPUs will work out you are iterating through an array of pointers, read one or more indices ahead in that array, and then prefetch the memory pointed to by that pointer?

    • @deadalnix
      @deadalnix Před 7 lety

      Modern Intel do it to some extent. I'm not aware of any other architecture doing it.

    • @onosendaiAAB
      @onosendaiAAB Před 7 lety

      Interesting, do you have any further information or references about that?

    • @onosendaiAAB
      @onosendaiAAB Před 7 lety

      I've just done some tests on my Ivy Bridge, and I'm pretty sure it doesn't do indirect memory prefetching.

    • @deadalnix
      @deadalnix Před 7 lety

      Note that prefetching on pointer load is bound to be slow no matter what. If you want to measure if the CPU does prefetching or not, make sure you use the same algorithm, but one doing prefetching explicitly using compiler intrinsic, and the other letting the CPU do its thing.
      If you see no significant perf difference, your CPU does prefectching. If the one with prefectching is faster, then you CPU doesn't do prefecthing.
      I wouldn't be able to tell you if your specific CPU does it. Some Intel do it, but I'm not sure which ones.

    • @noahgoldstein3951
      @noahgoldstein3951 Před 2 lety +1

      @onosendaiAAB I think you're right there is no indirect prefetching. Possibly he means that the execution of later iteration loads can occur speculatively and in parallel to the first load because there is no dependency.

  • @MrAbrazildo
    @MrAbrazildo Před 7 lety

    I don't get the point about SmallVector class having a fixed size, whereas the containers goal is to change it. I can only guess that 'N' is the maximum size the "simulated container" will ever get, right? But how would you know that?

    • @ChandlerCarruth
      @ChandlerCarruth Před 7 lety +2

      It's not a fixed size -- it's an initial *allocation* size. The container still is dynamically sized like std::vector.

    • @MrAbrazildo
      @MrAbrazildo Před 7 lety

      +Chandler Carruth I saw the push_back f(), however, it will produce just 1 element. And even if it create another buffer, how Begin and End would work with the 2 or more arrays, without an extra code to handle the situation?
      I guess you are talking about low-level, which would allow you to delete an compiling-time array, and realocate it, unlike C/C++ allow to do, right?

    • @ChandlerCarruth
      @ChandlerCarruth Před 7 lety +2

      No, this is totally built with what C++ already allows you to do.
      The begin and end pointers start off pointing at the inline small buffer, and if extra space is needed it gets allocated and the begin and end pointers are moved to point at the new buffer.
      You can find all the code for the real implementation here: github.com/llvm-project/llvm-project/blob/master/llvm/include/llvm/ADT/SmallVector.h

    • @MrAbrazildo
      @MrAbrazildo Před 7 lety

      Chandler Carruth So, the penalty is risking to have an unused array, after the memory enlarge. Good idea, though. The pointers can also return to the array, if memory is freed.
      How much an array is faster than std::list? For GCC, some years ago, I measured that an array, simulating std::map, is 6-7 times faster.

    • @von_nobody
      @von_nobody Před 7 lety

      btw How about intrinsic lists? Or when you say "list" you mean list `List` instead of `List`?
      www.codeofhonor.com/blog/avoiding-game-crashes-related-to-linked-lists
      This have very good properties if list are very dynamic and object can
      move between lists. Of corse it have one big drawback, object can
      belongs only to one list at time.

  • @MrAbrazildo
    @MrAbrazildo Před 7 lety +6

    _Value.template_, I don't even know what this means.

    • @barrettellisadair
      @barrettellisadair Před 7 lety +6

      It is a disambiguation for accessing a templated member of an object that is itself a dependent type.

    • @xfreeman86
      @xfreeman86 Před 5 lety +3

      It's hard to parse as a human, and you just happened to get it wrong is all. Read it as (Value).(template is)(). It's for when you call a template member function with type parameters that cannot be deduced from its formal parameters.

    • @NXTangl
      @NXTangl Před 5 lety +1

      @@xfreeman86 ...wait, you're Gordon Freeman's brother, aren't you?

  • @NoNameAtAll2
    @NoNameAtAll2 Před 3 lety +2

    black magic

  • @MrAbrazildo
    @MrAbrazildo Před 7 lety +1

    > 5:00, in derived class, you could had use the base constructor instead (C++11 and above): _using SmallVectorImpl::SmallVectorImpl;_.
    > 8:36, at least for GCC, _std::pair_ drops performance by the half, compared to 2 separately values. It was shocking to me (absurd geek moment)! However, I guess that an std::array , and a first and second f()s, fix this issue.
    _Edit: nope. This array is as slow as std::pair._

    • @EvanED
      @EvanED Před 7 lety

      Re. pair, what are you comparing? Separate correlated arrays for keys and values (e.g. 'KeyType keys[N]; ValueType values[N];) vs an array of pairs?

    • @MrAbrazildo
      @MrAbrazildo Před 7 lety

      +EvanED I'm comparing:
      1) int m; int n;
      2) std::pair p;
      3) class Pair {
      std::array a;
      public:
      inline const int &first () const {return a[0];}
      inline int &first () {return a[0];}
      The same for second, returning a[1];
      };
      The case 1 is more than 100% faster than the other ones.
      I also tryed to make a 4th case: packing a pair inside a size_t variable, which turned out to be more than 100% slower than case 1 too.

    • @Sopel997
      @Sopel997 Před 7 lety

      the thing is array is homogenous

  • @alexeiz
    @alexeiz Před 7 lety +1

    Contrary to what Chandler says, bit packing is slow. You won't find it in use very often in the modern code. Bit packing makes sense only if you save substantial amount of data size (megabytes).

    • @MattGodbolt
      @MattGodbolt Před 7 lety +20

      Do you have any qualification for that? On the machines I use (Haswell server class), bit packing is almost free, taking around 2/3rds of a CPU cycle to shift and mask.

    • @DeGuerre
      @DeGuerre Před 7 lety +13

      Bit packing almost always wins if it means your working set fits in cache, or for anything that is to be transmitted over a network.

    • @deadalnix
      @deadalnix Před 7 lety +9

      Bit packing: 1/2 cycles.
      L1: 3-4 cycles
      If your data fit in L1, then bitpacking is probably not worth it. L1 is 32k.
      L2: 10-12 cycles
      If your data fit in L2, it's probably a good idea, but not sure you'll win a lot. L2 is 256k.
      If you manipulate anything bigger than this, bitpacking is DEFINITIVELY worth it. Most programs are here.
      Caveat: bitpacking makes your code bigger, so if you are icache bound - you have a large executable with fairly flat profile - then you may want to reconsider.

    • @SerBallister
      @SerBallister Před 7 lety

      Depends on your usage case, if you are primarily setting/clearing bits then it will need more instructions compared to say, using an array of bytes, but if you are testing them, particularly multiple bits at the same time then you make savings.

    • @nameguy101
      @nameguy101 Před 7 lety +21

      Found the Javascript coder.

  • @jonesconrad1
    @jonesconrad1 Před 5 lety +1

    i just got lost on xkcd.com for 10 minutes

  • @rubberduck2078
    @rubberduck2078 Před 2 lety

    >llvm
    >Fast