CppCon 2016: Fedor Pikus “The speed of concurrency (is lock-free faster?)"

Sdílet
Vložit
  • čas přidán 4. 10. 2016
  • CppCon.org
    -
    Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/cppcon/cppcon2016
    -
    This talk takes the “ultimately practical” approach to concurrent programming, with a focus on lock-free programs: after all, in reality such programs are almost always written in the hopes of getting better performance. We’re going to measure performance of the individual concurrent primitives and their effect on the overall performance of the whole program.
    The goal of the talk is two-fold. On one hand, I will show a set of tools and practices that can be used to get quantitative measurements of the performance of different implementations under various load conditions. Mastering these techniques will allow the attendees to choose their concurrent algorithms and implementations based on solid data instead of guesswork or “common knowledge” (which is often wrong or outdated). On the other hand, even with the focus on real-life applications we can learn a few things about the fundamental nature of concurrent programs. This understanding comes especially useful when dealing with the “common knowledge” and “simple logic”. For example, it’s “common knowledge” that lock-free programs are faster than lock-based (not always). It’s also a “simple logic” that the hardware must protect shared memory access in a multi-core system, so ultimately locking is always present (sometimes true, sometimes true but misleading, and sometimes false). It is both “common knowledge” and “simple logic” that a wait-free program does not wait (but if your definition of wait is “will I have to wait for the results after I finish my coffee?” then it definitely does).
    We will explore practical examples of (mostly) lock-free data structures, with actual implementations and performance measurements. Even if the specific limitations and simplifying assumptions used in this talk do not apply to your problem, the key point to take away is how to find such assumptions and take advantage of them in your specific application: after all, in reality it’s almost always about performance.
    -
    Fedor Pikus
    Mentor Graphics
    Chief Engineering Scientist
    Portland, Oregon Area
    Fedor G Pikus is a Chief Engineering Scientist in the Design to Silicon division of Mentor Graphics Corp. His earlier positions included a Senior Software Engineer at Google and a Chief Software Architect for Calibre PERC, LVS, DFM at Mentor Graphics. He joined Mentor Graphics in 1998 when he made a switch from academic research in computational physics to software industry. His responsibilities as a Chief Scientist include planning long-term technical direction of Calibre products, directing and training the engineers who work on these products, design and architecture of the software, and research in new design and software technologies. Fedor has over 25 patents and over 90 papers and conference presentations on physics, EDA, software design, and C++ language.
    -
    Videos Filmed & Edited by Bash Films: www.BashFilms.com
    *-----*
    Register Now For CppCon 2022: cppcon.org/registration/
    *-----*

Komentáře • 10

  • @olaliljedahl9214
    @olaliljedahl9214 Před 5 lety +10

    Write contention is the primary scalability limit for write-invalidate cache coherency protocols. Each thread that attempts to update a cache line must fetch the line (in exclusive or modified state) and invalidate all other copies. Lock-based and lock-free designs that have the same memory write behavior (from the perspective of the CC protocol) will also scale and perform similarly. There are pitfalls to watch out for though, e.g. CAS may request the line in exclusive state even if the comparison fails and the update is not performed ("false write contention"?). Similarly, certain spin lock implementations (using exchange instead of compare_exchange) can invalidate the line with the lock in the lock owner's cache even if the lock is already busy. CPU microarchitecture, interconnect (inter-CPU) latencies and thread count/contention level also influence the actual impact on performance and scalability.

  • @RealVercas
    @RealVercas Před 7 lety +13

    I really hoped you'd test ticket locks as well. These are much more efficient than binary spinlocks in kernels (where there is no interruption).

  • @user-yc5fq9bv3u
    @user-yc5fq9bv3u Před 2 lety +6

    18:11 I do not believe that spinlocks are power-efficient at all. If atomic access is taking CPU time but not wasting power it's probably more efficient even though it's longer.

  • @hanyanglee9018
    @hanyanglee9018 Před 2 lety +2

    I'm curious, if a singly linked list or queue is only for task dispatch, children thread would only get their one node and leave. They only have to modify the head, and then they consume their own nodes. This way is way simpler.

  • @MohammadHusain
    @MohammadHusain Před 7 lety +2

    Great talk!

  • @MatthijsvanDuin
    @MatthijsvanDuin Před 5 lety +5

    On plenty of architectures (e.g. ARM) incrementing an std::atomic is not wait-free

    • @olaliljedahl9214
      @olaliljedahl9214 Před 5 lety +4

      ARMv8.1-a introduces atomic instructions such as LDADD and STADD. There could still be delays due to contention and I don't know to what extent fairness is guaranteed.

  • @XiongZou
    @XiongZou Před 4 lety +1

    Does anybody know the url of the Dan's talk Fedor mentioned in the beginning of his talk?

  • @MrAlireza07
    @MrAlireza07 Před dnem

    It's a shame how bad this presentation is prepared. Hopefully nobody starts learning concurrency from here. Don't get me wrong, I like most of the Fedor talks, but in this one, I'm not sure what's the point of showing some performance results without providing explanation.