Evolution of a Median Algorithm in C++ - Pete Isensee - CppCon 2023

Sdílet
Vložit
  • čas přidán 27. 02. 2024
  • cppcon.org/
    ---
    Evolution of a Median Algorithm in C++ - Pete Isensee - CppCon 2023
    github.com/CppCon/CppCon2023
    You won’t find std::median in <algorithm>. Should it be? Come on a journey to explore various implementations of median(), starting from specific use cases to progressively more generic versions. We’ll tune for performance and integrate more advanced features like execution policies and concepts. We’ll discuss the pros and cons of various approaches, and you can make your own decision about whether this algorithm and related statistical functions belong in the Standard Library.
    ---
    Pete Isensee
    Pete Isensee has been in the digital entertainment business since 1994. He's programmed video games, shipped three generations of Xbox consoles at Microsoft, created VR experiences at HBO, published dozens of technical articles, and taught C++ tips & tricks to game developers worldwide. Pete is currently a software engineering director at Reality Labs Research, helping unlock the potential of virtual and augmented reality. He enjoys jazz, wines and backpacking.
    ---
    Videos Filmed & Edited by Bash Films: www.BashFilms.com
    CZcams Channel Managed by Digital Medium Ltd: events.digital-medium.co.uk
    ---
    Registration for CppCon: cppcon.org/registration/
    #cppcon #cppprogramming #cpp
  • Věda a technologie

Komentáře • 33

  • @StefaNoneD
    @StefaNoneD Před 2 měsíci +5

    This talk was surprisingly good!

  • @user-ce9jx1bq5x
    @user-ce9jx1bq5x Před 3 měsíci +4

    I really love this presentation’s format! It feels very accessible.

  • @guiorgy
    @guiorgy Před 3 měsíci +4

    Great talk, accessible to all levels of programmers

  • @CharlesHogg
    @CharlesHogg Před 3 měsíci +1

    What a beautiful talk, and so well delivered too! One change I'd suggest is to keep the return type the same.
    When we exit the domain of the input types, the code gets harder to reason about. There will always be integral values that can't be represented in a floating point of the same size. If we started with integers, then let's stick with them! And if that means the median of 4 and 5 is 4, then so be it.

  • @ankos314
    @ankos314 Před 3 měsíci +2

    Such a clear and helpful talk!

  • @rickjames1220
    @rickjames1220 Před 2 měsíci +1

    Great talk! 🙂

  • @hymen0callis
    @hymen0callis Před 3 měsíci +2

    I really liked this speaker. Relatively simple topic, but presented in an entertaining and informative fashion. I took inspiration from this.

  • @debajyotimajumder472
    @debajyotimajumder472 Před 2 měsíci

    Great talk. Loved the talk

  • @kodirovsshik
    @kodirovsshik Před 2 měsíci

    Awesome talk!

  • @johnnycashcow1130
    @johnnycashcow1130 Před 3 měsíci

    Awesome talk! Wish there were more talks like tgis

  • @bruderdasisteinschwerermangel
    @bruderdasisteinschwerermangel Před 3 měsíci +1

    very interesting talk, and I yeah I think a percentile/median function in the standard library would be very nice to have.
    But to close off the story he started... if a beginner straight out of university wanted to write a template like this, I'd probably recommend them to stick to the 5 lines of code from the beginning. Generic algorithms are great but you shouldn't write a generic algorithm if you don't actually have a use for it.

  • @onebronx
    @onebronx Před 2 měsíci

    Very nice and concise talk, with a smooth progression to the point. Love it.
    Wouldn't be better to pass a "midpoint policy" as a template parameter of the function, with the std::midpoint as a default policy?

  • @CharlesHogg
    @CharlesHogg Před 3 měsíci +1

    Good point about the permutation being both surprising, and necessary for performance. But, good news: copying is also O(N)! So you can take the container by value, and still have an O(N) algorithm without mutating the user's data.

    • @AlfredoCorrea
      @AlfredoCorrea Před 3 měsíci +1

      I think that if the element type is arithmetic (midpoint is well defined) one can devise an algorithm that is N log N in the average case, by doing bisection on guesses for the median, without modifying the container or using extra space.
      The problem then would be how to call these algorithms when you have a non-modifying version and a modifying versions, or if just using overloads with the same name. Maybe the presented algorithm should be called median_range and the non modifying version should be called arithmetic_median_by_bisection, since they return different things.

    • @isodoubIet
      @isodoubIet Před 11 dny

      Taking the range by value might still mutate the underlying container if it's a borrowed range.

  • @AlfredoCorrea
    @AlfredoCorrea Před 3 měsíci

    I sounds to me that, for consistency, the max_element (even case) should be brought next to the mid element by an extra swap or by reverse first element. This in turn can simplify the result type since both iterators will be next to each other. The cost of the extra swap could be amortized by subsequent median calls on half ranges (quartiles) or by a total sort if it is necessary in an eventual next step.

  • @adityatrivediii
    @adityatrivediii Před 3 měsíci

    nice talk!

  • @Antonio-yy2ec
    @Antonio-yy2ec Před 3 měsíci

    Pure gold

  • @aaardvaaark
    @aaardvaaark Před 3 měsíci

    Loved that. Was disappointed there was no time for questions at the end as they would've been interesting to hear. Near the start of the video, I paused it and googled algorithm for median, and there was mention of a "median of medians" algo being the optimum one. I'm left wondering if it would be faster than (and just as flexible as) the one presented.

    • @sinom
      @sinom Před 2 měsíci +1

      nth element is a generalization of the median of medians! So if you give it a mid point it will be exactly the same as median of medians, while also supporting other split points with other inputs.

  • @fareloz
    @fareloz Před 3 měsíci

    14:45 But what about std::next, std::prev for iterators? In the code at this point we use operator+() instead which is not generic

    • @Roibarkan
      @Roibarkan Před 3 měsíci +4

      The presented algorithm calls nth_element, and thus requires that the range is a std::random_range. The std::random_access_iterator concept (which lies at the heart of std::ranges::random_access_range) is fine with using operator+() as means of increment. So, I believe there’s no value in using std::next() (nor std::ranges::next()) - the more generic and works for ‘lesser’ iterator categories (e.g. input, output, etc.) in this scenario.

    • @fareloz
      @fareloz Před 3 měsíci

      ​@@Roibarkan Agreed!

  • @borone8573
    @borone8573 Před 3 měsíci

    I’m concerned about passing range as universal reference. It means that range must be copied or moved, doesn’t it? And it seems like we don’t copy anything, so const& won’t compile

    • @Roibarkan
      @Roibarkan Před 3 měsíci

      I believe the universal reference is always a reference, and won’t cost an extra copy. It’s not ideal because it will bind to a temporary container, and cause the algorithm to return dangling iterators into the container having been destructed. I think this issue is also relevant to most std::ranges algorithms (such as nth_element) so we’re in good company. I believe Nico Josuttis discusses this lifetime issue in his great back-to-basics talk: czcams.com/video/26aW6aBVpk0/video.html

    • @AlfredoCorrea
      @AlfredoCorrea Před 3 měsíci

      @Roibarkan It is not a matter of “ideal” or not, Universal references are necessary because they can take subranges that are temporaries but do not own the data or even have well defined iterators even after their destruction.
      @borone8573 And no, they can bind to an existing object without copying or moving

  • @axelriet
    @axelriet Před 3 měsíci +2

    You just “stole” one of my favorite interview questions 😊 Besides that the one answer that everyone is looking for is that of an indexed skiplist. Bonus point if it’s a *circular* indexed skiplist so you can compute the *moving* median, trimeans etc in O(logN) for any window size. I have a working implementation in C# that is fast enough to handle 15,000 book updates per second over 1,500 currency pairs from Binance (did this for an elusive trading bot that I never productized)

  • @fareloz
    @fareloz Před 3 měsíci +4

    13:45 So the basic naïve algorithm works 1s on 10M data and is run once a day. Do we really need to squish out every millisecond out of it? This is an example of when generic algorithm is worse solution, because it requires much more time to develop (hitting a lot of snags and investigation time), much more expertise and the outcome is much more cumbersome to read and maintain. Sometimes small unoptimized ad-hoc function is much better if it does the job for the task required.

    • @Roibarkan
      @Roibarkan Před 3 měsíci +11

      I think Pete’s goal wasn’t to show how a new-hire should develop on their first task at a company, but rather how a library such as the STL could solve an entire category of tasks (the median algorithm) for the entire c++ community

    • @doroke
      @doroke Před 3 měsíci +2

      I have profiled library code from signal analysis that spent 99% of its time calculating medians. It is absolutely important to give developers of all skill levels access to the lowest complexity implementations that exist!

    • @BilingualHobo
      @BilingualHobo Před 2 měsíci +1

      I recently had to rewrite a median filter algorithm that was taking 30ms to process ~10M pixels. The slowdown was the time it took to sort a 9 element kernel. And next I need to update it to get

    • @sinom
      @sinom Před 2 měsíci +1

      As he also said in this talk, in some cases the naive algorithm was also just incorrect or had UB. If you're never gonna call it with an empty vector, only want to pass in vectors and not any other containers, and don't care about it being correct in the even case then sure, the original is fine. But if you care about any of those then you should at least use part of the generalization.
      Also if you can use a better standard algorithm (like in this case nth element with median of medians instead of sort) I don't see too much of a reason not to.

    • @TheOnlyAndreySotnikov
      @TheOnlyAndreySotnikov Před 2 měsíci

      The generic version should be something that flows under the fingertips of an experienced developer without thinking twice. It requires much more time to develop, only for a newbie. This is why newbies have to learn. The same is true for readability. If you think it's challenging to read, you should probably learn. Your complaint is akin to: "Let's not use integrals in calculus; they are too difficult to read for me."