Google Patches Linux kernel with 40% TCP performance

Sdílet
Vložit
  • čas přidán 6. 06. 2024
  • Google submitted a patch to Linux Kernel 6.8 to improve TCP performance by 40%, this is done via rearranging the tcp structures for better cpu cache lines, I explore this here.
    0:00 Intro
    0:30 Google improves Linux Kernel TCP by 40%
    1:40 How CPU Cache Line Works
    6:45 Reviewing the Google Patch
    www.phoronix.com/news/Linux-6...
    lore.kernel.org/netdev/202311...
    Discovering Backend Bottlenecks: Unlocking Peak Performance
    performance.husseinnasser.com
    Fundamentals of Backend Engineering Design patterns udemy course (link redirects to udemy with coupon)
    backend.husseinnasser.com
    Fundamentals of Networking for Effective Backends udemy course (link redirects to udemy with coupon)
    network.husseinnasser.com
    Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)
    database.husseinnasser.com
    Follow me on Medium
    / membership
    Introduction to NGINX (link redirects to udemy with coupon)
    nginx.husseinnasser.com
    Python on the Backend (link redirects to udemy with coupon)
    python.husseinnasser.com
    Become a Member on CZcams
    / @hnasr
    Buy me a coffee if you liked this
    www.buymeacoffee.com/hnasr
    Arabic Software Engineering Channel
    / @husseinnasser
    🔥 Members Only Content
    • Members-only videos
    🏭 Backend Engineering Videos in Order
    backend.husseinnasser.com
    💾 Database Engineering Videos
    • Database Engineering
    🎙️Listen to the Backend Engineering Podcast
    husseinnasser.com/podcast
    Gears and tools used on the Channel (affiliates)
    🖼️ Slides and Thumbnail Design
    Canva
    partner.canva.com/c/2766475/6...
    Stay Awesome,
    Hussein
  • Věda a technologie

Komentáře • 434

  • @hnasr
    @hnasr  Před 3 měsíci +25

    Get my udemy fundamentals of backend engineering backend.win

    • @Rovsau
      @Rovsau Před 3 měsíci

      Kudos for a good explanation. Today, I learned!

    • @santapreese
      @santapreese Před 3 měsíci

      hello! thanks for the insightful video:) is it possible for you to publish your content on linkedin learning? some organizations have a subscription for that instead of udemy.

    • @TheSulross
      @TheSulross Před 2 měsíci

      Another tactic that could potentially be used by the Linux kernel to speed up the network stack would be to pin one or more CPU cores to exclusive use for packet pipeline processing. Pinning would mean taking said core away from the kernel scheduler as a resource to be managed during context switching of thread execution.
      This is a very well established tactic used by the Intel Data Plane Dev Kit (DPDK) and is referred to as Polling Mode Driver (PMD) approach. Such pinned cores execute code at 100% CPU cycle time, never block on anything - nor are they preempted by the kernel. In DPDK terminology, a pinned CPU core is referred to as an lcore and there is a DPDK API to execute C functions on said lcores. (Modern CPUs have so many cores that it warrants pinning some cores for specialized processing.)
      So I have a DPDK program where I can configure how many lcores to dedicate to a lcore worker pool. Am using C++17 for this program. Various pipeline operations get established as constructed callable objects, which means that if I have a callable object stack variable called action, I can just invoke it as action(), or just treat it as a piece of aggregate data.
      Said callable object supports C++ move semantics. If am configured to be executing the entire application on just a single lcore, then it just immediately invokes the callable object on that lcore context. If there are multiple lcores, then the action is C++ moved into a lock-free work queue. All of the lcores in the worker pool are in an endless loop of retrieving a callable object from the work queue and then just invoking it. Each lcore is assigned a unique token. The callable objects are published using this set of tokens - the publisher just ring cycles through the set of tokens as it publishes work items. The consuming lcores retrieve work items as identified by their respective token. This averts contention between lcores rendezvousing on the work queue and provides a fair, round-robin, load-balanced processing.
      This kind of network programming greatly exceeds what can be accomplished by writing conventional networking applications that rely only on the kernel network stack and conventional multi-threaded concurrent programming. And such applications run entirely in user space - never touching the Linux network stack. So the high performance data plane pipelines never block, never transition to kernel mode, and never get preempted by the kernel scheduler.
      And naturally data structures are tuned around the cache line size.

    • @treyquattro
      @treyquattro Před 2 měsíci

      ah, clickbait

  • @ThePapanoob
    @ThePapanoob Před 3 měsíci +288

    theres multiple things that coco did with "just moving around variables" that helped with the performance that you didnt even realize. :D
    When the data is fetched and a cache miss happens the cpu fetches the whole memorypage putting it in the TLB cache. Reorganizing it in the way that neighbouring variables are contextually dependend to each other means that you have less roundtrips to the heap. But the other thing coco did was he organized the variables in a way that the compiler didnt pad them as much. Because compilers tend to 32bit align variables. so if you use a bunch of smaller variables (u8) between bigger ones (u32) you will get alot of padding resulting in a way bigger struct. That in return means that you will have more chances of the data you need not fitting in 1 memory page returing in more heap reads.

    • @lab-at-home
      @lab-at-home Před 3 měsíci +17

      Ram bus go vrum vrum?

    • @petter9078
      @petter9078 Před 3 měsíci +5

      How do you even approach such a thing. How do you analyse it 🤔

    • @ThePapanoob
      @ThePapanoob Před 3 měsíci

      @@petter9078 trial & error… 😅 and loads of hours in ghidra & the godbolt compiler explorer

    • @davidmartensson273
      @davidmartensson273 Před 3 měsíci +20

      @@petter9078 If you know about the problem, just looking at a struct will be enough to see possible "holes".
      I never did anything on this level but back in the MSdos days you reorganized the order drivers was loaded because the OS started in the top and fetched the next free area the driver would fit into, but it never backtracked to check areas already skipped, so by rearranging the driver order you could get it to put a smaller driver in a space that otherwise got left empty.
      That way you could minimize the amount of memory below the 640 kb limit that the actual programs had access to, and that could be the difference between being able to run some games or not.
      Similar concept.

    • @mss664
      @mss664 Před 3 měsíci +6

      @@petter9078Processors expose performance counters. It ain't perfect, but you can see roughly where the cache misses are.

  • @solarsunny
    @solarsunny Před 3 měsíci +450

    Corollary: some wrong variable placements will give you 30% performance degradation.

    • @kartik4792
      @kartik4792 Před 3 měsíci +36

      It's not about variable placement but organising data in memory.

    • @earthling_parth
      @earthling_parth Před 3 měsíci +21

      ​@@kartik4792 wasn't the variable placement causing how the data was organized in memory?

    • @bladman9700
      @bladman9700 Před 3 měsíci +39

      @@kartik4792yes. organizing variable is how you organizing the memory. Genius.

    • @absalomdraconis
      @absalomdraconis Před 3 měsíci +10

      ​@@kartik4792 : Trying to distinguish variable placement and memory placement as different things will sometimes cause massive misunderstandings. Your pedantic attempt at such a distinction is bad and you should stop.

    • @elpapichulo4046
      @elpapichulo4046 Před 3 měsíci +4

      ​@@absalomdraconis i dont think its that pedantic, iirc the compiler can decide where on the stack a variable will be placed thus if two variables are declared after each other there might be no garuantee that they also are laid out next to each other in memory.

  • @tibbydudeza
    @tibbydudeza Před 3 měsíci +354

    A standard C optimization technique back in the old days when you cared about performance - alignment your data structs via padding using bit fields to how the CPU fetches data from memory and fills it caches.
    Programmers nowadays don't care about it.

    • @austinscott4695
      @austinscott4695 Před 3 měsíci +10

      wow, where can I read more about this?

    • @ErazerPT
      @ErazerPT Před 3 měsíci +96

      It's not that you don't care, it's that for the most part you don't have access to it. If you're not doing ASM, C, C++, similar, chances are you DON'T have low level access to ensure optimizations actually work.

    • @zoenagy9458
      @zoenagy9458 Před 3 měsíci +21

      Rule 1: never optimize

    • @tibbydudeza
      @tibbydudeza Před 3 měsíci

      It has been many decades since I did it.
      This usually was when you dealt with CPU's that wanted word aligned data structs and could not deal with unaligned data fetches like the Motorola 68000 which would give you an interrupt.
      You would do it like this by hand in your structs and use unions to overlay it to 32 bytes word 16 bytes.
      Later your C compiler would automatically pad your C structs/unions to make sure they fell on word aligned boundaries.
      The Intel memory controller was just smarter about it and did more than one memory fetch impacting performance
      @@austinscott4695

    • @tibbydudeza
      @tibbydudeza Před 3 měsíci

      Not many of us write kernel drivers or OS code.
      Game industry care about performance - you ever tried writing code for a PS3 with it's oddball architecture???.
      Believe me you going to be staring at profilers and how fast the GPU is being fed with data and the CPU is being kept busy and how much memory is being consumed.
      @@zoenagy9458

  • @youpapai
    @youpapai Před 3 měsíci +130

    I miss system programming, this was standard when writing drivers/kernel modules etc.
    High level programming, fast CPUs and large caches has spoiled everyone.

    • @ssl3546
      @ssl3546 Před 3 měsíci +27

      More like the Linux kernel is so dang big now that optimization projects like this get overlooked. When Linux was running on 8MB systems (with X!) much more attention was paid to small details.

    • @avalagum7957
      @avalagum7957 Před 3 měsíci +1

      Can the compiler help anything here?

    • @jfbeam
      @jfbeam Před 3 měsíci +6

      @@ssl3546 There shouldn't be any "optimization projects". If people paid attention to the efficiency of their code _when the first f'ing write it,_ no one would ever need to come back later to rewrite it. But that takes intelligence and time; things modern programmers don't have. (make things that can be sold, and get them to market as fast as possible. quality isn't necessary.)

    • @honkhonk8009
      @honkhonk8009 Před 3 měsíci

      Its partially because everyone thinks its like javascript or python, and dont realize that C is quite simple and straightforward.

    • @cheebadigga4092
      @cheebadigga4092 Před 3 měsíci +1

      it's always features over optimizations, the latter can come later at any point in time, just like this patch

  • @johnswanson217
    @johnswanson217 Před 3 měsíci +96

    Chill guys... It's not about superiority, it's about productivity. Google do this because they can reduce their server cost significantly, not just because they're capable to do so. There are tons of micro optimizations left to be done out there but nobody cares because it doesn't make money. It's that simple. Follow money not your ego.

    • @kounelios13
      @kounelios13 Před 3 měsíci +8

      No one said otherwise . Of course at the end of the day this pr benefits google in more than one ways

    • @hanifa205
      @hanifa205 Před 3 měsíci +4

      I agree with you. People will give attention details when they can see $$ in their eyes.

    • @chetan_naik
      @chetan_naik Před 3 měsíci +2

      Is the improvement for both ends of the connection? Meaning on both client and server?

    • @TimeLemur6
      @TimeLemur6 Před 3 měsíci

      Without looking into it beyond this video, I'd say it'll benefit both. That said, this type of optimization is dependent on scale, so the client will likely see negligible benefit unless they have a large number of connections open simultaneously.

    • @kaoboom
      @kaoboom Před 3 měsíci

      @@chetan_naikThe improvement is measured in CPU seconds spent divided by the duration of a TCP round trip. So there's no real data in the patch notes in regards to whether this helped with the latency portion.
      Basically, it depends on whether the bottle neck is the CPU side or the NIC side.

  • @kelvinleung3956
    @kelvinleung3956 Před 3 měsíci +8

    Love that you share the latest changes or improvements in the industry with easy-to-understand explanation. Love the passion and thank you for the sharing! Could you please share more about OS/memory/cpu level stuff when codes are exeuted in the future videos?

  • @shapelessed
    @shapelessed Před 3 měsíci +103

    So basically all that they did is moved things around to maximise the amount of data inside the structs that put sequentially would fit inside a 64-byte chunk of memory...
    Let's say I need to handle a request, and I need the variables X and Y, but there are 80 bytes of other stuff between them. Now the CPU needs to request those two variables in two separate requests. If X and Y were next to each other, they would most likely fit inside that 64B block of memory, which would then get cached. When we're finished with X, we don't need to re-request Y separately, because it has already been cached from that 64B chunk we requested previously. We just cut our memory access time by 50%.
    I didn't think about these sorts of things originally, but then I decided I'm a masochist and wrote a virtual filesystem with native AES encryption. I really learned a lot about memory when designing it, especially since AES works on 64-bit/8B chunks, so everything, every sector and block had to be structured perfectly...

    • @amalragc
      @amalragc Před 3 měsíci +2

      You explained it very clearly and in a very simple way. Thankyou.

    • @rupertsmixtapes812
      @rupertsmixtapes812 Před 3 měsíci +1

      I thought there was only AES-128 and AES-256, with 128 bits being a block size of 16 bytes, not 8 bytes..?

    • @giannismentz3570
      @giannismentz3570 Před 3 měsíci +1

      Thanks for the explanation. By your "all they did" i assume this was simple to do or see? Then why hasn't this been done ages ago? 40% improvement is no joke.

    • @shapelessed
      @shapelessed Před 3 měsíci +2

      @@rupertsmixtapes812 Yeah, sorry forgot whether it was 8 or 16 bytes because it was like 2 years ago and haven't really needed to dive down into the details of any encryption algorithm since then, thanks.

    • @shapelessed
      @shapelessed Před 3 měsíci +1

      ​@@giannismentz3570 I clicked the video because I was interested, I heard about CPU caching, memory, then after the word "structs" I suddenly realised "Hell, they only just describe the type and order of data inside memory, I bet they just moved things around so that the most needed props are closer together and faster to read."
      And as described in the video. Most people these days use, or start out their journey using high-level languages that don't give a damn about order of data, much like JavaScript and TypeScript's structs/interfaces which only just serve you at compile time. ThePrimeTime had recently posted a video about how junior devs are "noobs", where they discussed how people have stopped learning how to truly understand the code is actually doing on their machines, mostly because their performance is so large that people started focusing more about writing good looking code that is fast to understand and work with, rather than squeezing the performance they probably will never need.
      The abundance of computing resources made these hyper-optimisations uneconomical, it's more profitable to roll out features than to refine existing ones.
      When you go to a store, do you waste your precious time to find a plastic bag somewhere in your home and bother bringing it with you, or do you just buy a new one every time? They are cheap and abundant...

  • @sammyj29
    @sammyj29 Před 3 měsíci +20

    It's just the way struct packs variables into memory. Going back to the basics!

    • @heyhoe168
      @heyhoe168 Před 3 měsíci +1

      Yeah, but one should always check with current architecture specification. What worked for i386 might be wrong for i686.

  • @susmitvengurlekar
    @susmitvengurlekar Před 3 měsíci +5

    Superb, its just like reading a page when it comes to database. Which is why clustered table has its benefits. Its all about optimising for the next action.

  • @Twangaming
    @Twangaming Před 2 měsíci

    Thanks for covering this!

  • @michaelelkin9542
    @michaelelkin9542 Před 2 měsíci

    That is a very good descriptions. Surprising that something so simple makes such a bid difference. I would have expected this to be a compiler optimization, and I hope it will be in the future.

  • @ahmetyasarozer
    @ahmetyasarozer Před 3 měsíci +8

    Thanks Hussein ❤

  • @_sudipidus_
    @_sudipidus_ Před 3 měsíci +3

    this is fascinating.. recently I was exploring how golang does padding on struct variables to align those variables to fit the 32/64 bit word length so that memory reads align
    never occurred to me it would bring this huge improvement.. I thought it was a microoptimization but I guess it builds up when it repeats (networking stack, TCP connections definitely happen in a huge volume and it's good to see the micro optimization getting multiplied to attain a significant improvement)

  • @BR-lx7py
    @BR-lx7py Před 3 měsíci

    Love this! Best part is that it's trivially correct.

  • @mattmmilli8287
    @mattmmilli8287 Před 3 měsíci

    Very nice topic and presentation

  • @darlokt51
    @darlokt51 Před 3 měsíci +44

    AFAIK This is as the data shows an AMD specific optimization. Intel uses an, compared to AMD, highly advanced prefetcher that unlike normal hardware prefetchers works not on memory adress acesses but also on the data values themselves. By this a physical reordering of the data in memory does not significantly impact the accuracy of Intel’s prefetcher.

    • @BattousaiHBr
      @BattousaiHBr Před 3 měsíci +6

      interesting. so that means there could be many other areas of the kernel that could be heavily improved on AMD whereas intel is mostly unaffected.

    • @darlokt51
      @darlokt51 Před 3 měsíci

      @@BattousaiHBr Yes, AMDs Processors overall are a bit rough around the edges from a hardware perspective and brute force oriented. One can optimize programs to the quirks of singular architectures (as is done in supercomputing, as you know your hardware and can easily optimise for it), but this generally come with a regression for other architectures, what you should prevent. This change for example is simply better in general for the acess patterns in production and even Intel profits, if smaller, but AMD greater due to their problems in architecture. But because of this the better solution are smarter prefetchers and branch predictiors, to generally improve performance, because we all write shitty code and it’s better if you don’t lose 40% performance due to bad code / hardware architecture, not every program can be optimised to this level, hell, most programming languages don’t even offer this low level memory management. To make all code faster the Platform it runs on must be smarter.

    • @ismbks
      @ismbks Před 3 měsíci

      where did you read about this

    • @darlokt51
      @darlokt51 Před 3 měsíci +18

      @@ismbks Its related to my work, you can find information about the different prefetchers in CPUs, in security guides by the manufacturer. Prefetchers and branch predictors can lead, if wrongly implemented, to security vulnerabilities, information about prefetchers or branch predictors is mostly only given if a security vulnerability is found, as they are responsible for a lot of current processors performance gains in architecture.

    • @tacokoneko
      @tacokoneko Před 3 měsíci +3

      @@darlokt51are you prohibited from revealing more information by your anonymous employer

  • @marcosfonseca2249
    @marcosfonseca2249 Před 3 měsíci +10

    Well some compilers organize variable order in structs, but is mostly based on alignment to get efficient memory access. But in this case, aot compilers would have trouble because this kind of optimization relies on the runtime aspects of the problem. Consequently, optimizing this at the compiler level becomes more difficult, given that those compilers are generic

    • @rogo7330
      @rogo7330 Před 3 měsíci +2

      It would break everything if your code would call with those sort of optimizations to other functions that take that struct. It should work with structs that only used by static functions in one file scope or with something like --fwhole-program (which basically makes everything static and exposes nothing outside the elf file), but I didn't ever checked that.

    • @phillipsusi1791
      @phillipsusi1791 Před 2 měsíci

      In C, the compiler is required to place the variables in memory in the same order they were declared in the struct. It is allowed to pad them for alignment, but not to change the order.

  • @sweetshit4931
    @sweetshit4931 Před 3 měsíci

    Really insightful.

  • @Idlecodex
    @Idlecodex Před 3 měsíci +6

    Hey Hussein, this is the classic case of keeping memory compact according to it's temporal access, maybe intel's branching prediction was already doing the job of keeping the cache hot and that's why there was not much gain there. Another possibility is that the compiled code differs on the two architectures. Any how I always hear that optimization is architecture dependent, this seems to be the case.

  • @TheEndorfinized
    @TheEndorfinized Před 2 měsíci

    dig your observations. Absolutely ..everyone should look out for things we overlook.

  • @meqdaddev4341
    @meqdaddev4341 Před 3 měsíci

    Impressive!
    How the small and tiny things have this big impacts!

  • @Sil3ntOne
    @Sil3ntOne Před 3 měsíci +2

    Subscribed 🎉

  • @honkhonk8009
    @honkhonk8009 Před 3 měsíci +3

    whats great about C is the fact that its so deterministic and simple.
    Structs? Oh their just basically arrays lol.
    Then it starts making sense how to optimize it for the cache

  • @zoenagy9458
    @zoenagy9458 Před 3 měsíci

    excelent content!

  • @AmjadKhateeb
    @AmjadKhateeb Před 3 měsíci +3

    Nice explanation. For your question at the end about auto optimize the order by the compiler. I don't think it will work. I think the struct is related to many things that are depends on the order, like serialization, and saving data to binary files. So, if you serialize or save to file in some order, then you have to deserialize in the same order.

    • @cameramaker
      @cameramaker Před 3 měsíci +1

      I think the best would be to have an allocation hint - to cacheline boundary (64B currently) for every malloc/kalloc call (or just align everything that way, lol)

  • @devenparab
    @devenparab Před měsícem

    Such a beautiful explanation Nasser, feels like listening to song

  • @berndeckenfels
    @berndeckenfels Před 2 měsíci +1

    It’s not only about moving things next to each other but also move things which are updated by different CPUs on different cacheline so they don’t need to be invalidated on other sockets. This was more regularly done (besides having per-CPU struct)

  • @tesla1772
    @tesla1772 Před 3 měsíci

    Cant wait for the OS course

  • @cheebadigga4092
    @cheebadigga4092 Před 3 měsíci

    Wow!! I have never thought once about how the alignment of C struct fields interacts directly with memory pages and L3 cache. Simple but massively effective. Props to Google for that one!!

  • @thinkingcitizen
    @thinkingcitizen Před 2 měsíci

    I've done your Udemy course, I've watching 40-50 of your videos, I just want you to know I frigging love you

  • @BlurryBit
    @BlurryBit Před 3 měsíci +2

    ok wow 40%, you have my interest!!! 😮

  • @matthewsheeran
    @matthewsheeran Před 3 měsíci +3

    The cache line seems entirely synomous with pages in the kernel page cache. Huge performance increases are par for the course. For higher level programmers things like iterators and lazy loading make similarly huge performance improvements not to mention threading and multi-procesing. I am guessing that you may have hit the nail on the head at the end: as variable location may be compiler dependent but putting variables in a struct may then make the compiler then follow that exact order of definition.

  • @morthim
    @morthim Před 2 měsíci +3

    'they just moved some variables around'
    litterally the job of a programmer.

  • @Felipe-bi3mk
    @Felipe-bi3mk Před 2 měsíci

    Hey Hussein, would you make a video about how does it work when we have an index in X column, and we make a query that has a "table.X IN (1, 2, 3, ..., 300) " but there is a big list of parameters inside the In clause? I know it uses the index but seems very slow

  • @ac130kz
    @ac130kz Před 3 měsíci

    you can also lower connection close timeout for essentially free performance

  • @ahmdelemam
    @ahmdelemam Před 3 měsíci

    جميل يا حسين كالعادة ❤

  • @rafagd
    @rafagd Před 3 měsíci +11

    The compiler can't auto reorder variables because it breaks ABI compatibility with stuff compiled against your lib

    • @anandsharma7430
      @anandsharma7430 Před 3 měsíci +1

      This aspect should be given more importance in the discussion. I assumed that compilers already did these kinds of optimizations. Also, the issue of whether the network hardware stack is so uniform (around the world) that these optimizations work on a majority of routers / switches / other network server hardware - enough to make it into the main kernel.

    • @rafagd
      @rafagd Před 3 měsíci +3

      ​@@anandsharma7430 There's a very interesting document in the KDE dev guidelines, and one of the biggest no-no's is that you can't change the order or even add/remove non-static data members to a class that is in use (regardless of visibility), because it will force a break on the binary compatibility.

    • @laden6675
      @laden6675 Před měsícem

      C is so old. Insane to think the performance improvements Linux could get if they fully switched to Rust, since it does reordering and this stuff is internal.

    • @rafagd
      @rafagd Před měsícem

      @@laden6675 That is not how it works. While the C language is old, it's compilers are not. They are, in fact, built upon the same llvm backend that powers Rust. And the language is simple enough that loads of optimizations can be applied to it. C is still around for a reason.
      Of course, Rust CAN produce faster programs than C, but the inverse is often the case.
      Rust edge here is that it's harder to leave a security hole there than it is in C, and that in most cases, that comes at zero performance cost. Harder, but not impossible.
      Finally, Rust's ABI is famously not stable so we'll still have to use C ABI to interop with dynamic libs for a while. While the Linux kernel is monolithic, it still needs to expose system calls, that will need to be called from a C program. So they would need to disallow the variable reorder anyway.

    • @laden6675
      @laden6675 Před měsícem

      ​@@rafagd C compilers need to be standard compliant, and changing the order of fields from the one that's declared is non-standard.
      Even if they could, because of how C programs are structured where a program is built from a collection of object files that need to have stable ABI which are independently compiled, there is much less opportunity for optimization than in Rust.
      The vast majority of structs and functions are NOT exposed to the userspace, so this is a strawman.
      > C is still around for a reason.
      Inertia. It will take time, but over time there will be less and less C code in the software we use.

  • @notgauravmehta
    @notgauravmehta Před 3 měsíci +4

    cool video - I dont think it would make sense for the compiler to reorder the members of the struct. In some cases, I've casted the entire struct to a bytestream, and deserialized it somewhere else. In this case, if its reordering the members of the struct, my deserialization logic would fail on the other side, which would be unexpected behavior.

    • @PaulSpades
      @PaulSpades Před 2 měsíci

      Completely true. A struct needs to be binary predictable, which is why this works.

  • @NaokiWatanabe
    @NaokiWatanabe Před 3 měsíci

    L3 cache latency on an EPYC CPU is in the tens of nanoseconds range (19 to 31ns).

  • @rayc1557
    @rayc1557 Před 2 měsíci +1

    In the 1980's all thru the 90's, we did all our programming with this level of attention. Prior to then, I programmed mainly in assembly language -which required extreme attention to every instruction in the listing. BTW: The fact that IBM processors saw only a 1 to 5% increase, could be attributed to Intel possibly having a better caching structure and algorithm to start with.

  • @shadyabhi
    @shadyabhi Před 3 měsíci

    How do I identify these data reorganization possibilities? Any resources?

  • @maxwellsmart3156
    @maxwellsmart3156 Před 3 měsíci

    Benefits of code runtime analysis using a profiler. You can also benefit code performance by moving around code blocks. Most of the time it isn't about how fast the code is but how fast one can produce the code. There have been many hardware architectural and performance improvements since 1991 and most of the time code just lays dormant and gets reused.

  • @shadyabhi
    @shadyabhi Před 3 měsíci

    What resource are you currently using to learn more about cache lines and data driven design?

  • @anandnaik9802
    @anandnaik9802 Před 2 měsíci

    Can you please tell how it was discovered that such an optimization can be made? Do there exist any performance monitoring libraries for this specific purpose? Also, since this is an optimization problem, what is the best approach to find the optimum i.e. was this solution arrived at by human knowledge along with some data analysis or was this problem solved by some Reinforcement Learning approach?

  • @myne00
    @myne00 Před 2 měsíci +1

    It's kinda crazy that optimisations this big are still possible after 30 years.

  • @tenthlegionstudios1343
    @tenthlegionstudios1343 Před 3 měsíci

    Awesome video. For more on this topic of memory alignment and cache lines, I highly suggest Andrew Kelleys video from Handmade cities conference. This is also known as Data Oriented Design - loosely. The gains he got out of his compilation for zig with similar optimizations was insane. And he has real examples from the code. Thanks for the vid Hussein!

  • @hackaholic01
    @hackaholic01 Před 3 měsíci

    I know something about spatial locality(when an address is a accessed then in future near address can be accessed ). This was not obvious that struct should hold variables together adjacent together. How they can miss this at the first place?

  • @first-thoughtgiver-of-will2456
    @first-thoughtgiver-of-will2456 Před 2 měsíci +1

    Rust automatically optimizes for argument position so your struct is still readable but doesnt require excessive padding.

  • @rollinkendal8130
    @rollinkendal8130 Před 3 měsíci

    awesome!

  • @xcoder1122
    @xcoder1122 Před 3 měsíci

    C compilers could in theory optimize structs themselves by re-ordering their members, the problem with that is, that the struct layout is not stored in the compiled code, so if a binary and multiple libraries share the same struct type and are built with different compilers or different versions of the same compiler, the order may differ and then everything would break, that's why compilers won't change the order. Also compiled code relies on the order given at compile time, so every time the order changes, all code that reads/writes from/to the struct, must be re-compiled. Optimizing the order of structs is something programmers must do by hand in C and can only do if no external code relies on a given order (as is the case with the Linux kernel, where non-kernel code doesn't know anything about those in-kernel structures).
    In languages other than C, compilers may re-order members of data structures themselves but that is only possible if the compiled code stores the order that was used by the compiler and if all compiled code is written in such a way, that it can adapt to any order chosen and any order change in the future without requiring re-compilation. The downside of such a feature is that all structure access will be slower and the code will be bigger and more complicated. Keep in mind that C is simple, low level and supposed to be fast.

  • @olafbuitelaar
    @olafbuitelaar Před 3 měsíci

    The intel bench seems to have a 200gbs nic vs the amd only 100?

  • @straightup7up
    @straightup7up Před 2 měsíci

    Significant performance improvements with simple realignment.
    I wonder if changing tab stops from 4 to 2 will have similar effects.

  • @embeddedbastler6406
    @embeddedbastler6406 Před 3 měsíci +3

    One has to wonder which performance improvements might be possible by implementing a data orientied design, i.e., the switch from the traditional Array-of-Structs (AoS) to the more cache-friendly Struct-of-Arrays (SoA) memory layout.

    • @PR-cj8pd
      @PR-cj8pd Před 3 měsíci

      Is this theoretical or does anything like that already exist?

    • @embeddedbastler6406
      @embeddedbastler6406 Před 3 měsíci +2

      @@PR-cj8pd As an example, data oriented design is becoming more common in modern game engines in the form of Entity Component Systems (ECS). For compute intensive workloads such as physics simulations, this approach promises a considerable speedup.

    • @smlgd
      @smlgd Před 2 měsíci +2

      @@PR-cj8pd computer graphics stuff use this a lot, from renderers to game engines, often the whole software architecture is based off that concept

  • @Skungalunga
    @Skungalunga Před 3 měsíci +2

    I'd like to know what the relative performance of TCP/IP with respect to Intel and AMD before the optimization was implemented. Is it that AMD was faster and the gap grew, the performance was the same and AMD gained an advantage or Intel was faster in the first place. Long ago Intel used a cache inclusive approach and AMD caches levels were exclusive and so had a higher latency. I'm not sure if that is still the case but it might have some bearing on how to optimize the code.

  • @yashkhd1100
    @yashkhd1100 Před 3 měsíci +22

    I can't beleive it...Kernal TCP/IP stack was without padding in mind. Years ago when I used to write Game Engines..the first thing you do with various data structure is to align it with padding. There are tons of pragmas exist on all majority C/C++ compilers to support this. This perf boost is good...but it's like someone forgot to change a Engine oil in a car and than when they change it after years they feel gud about it and Car is running smooth too.

    • @kanishkgoel2673
      @kanishkgoel2673 Před 3 měsíci +9

      This one didn't have a lot to do with padding, more with spacial locality, things that needed to be used should be kept close together.

    • @rogo7330
      @rogo7330 Před 3 měsíci +2

      Compiler already pads variables for you: all longs will be on addresses alligned to long's size, all ints alligned to size of int, all struct x alligned to size of struct x, etc. This have nothing to do with padding, it's about that your variable is much bigger than cpu can fetch at once and it goes to the memory more. Padding is about your long is missaligned with its size, and because of that CPU must fetch two ints and take only last half of first and first half of second, glue them together and save it somewhere else. That's not the problem here.

  • @astronemir
    @astronemir Před 3 měsíci +1

    This is fucking amazing actually

  • @NatesRandomVideo
    @NatesRandomVideo Před 3 měsíci

    Great discussion. Now they need to add the same optimizations for the other architectures! Grin…

  • @houdaifabouamine
    @houdaifabouamine Před 3 měsíci +2

    Thank you,
    I hope that you share more content like that, a payed series is also good idea.

  • @dirkschannel5817
    @dirkschannel5817 Před 3 měsíci

    Cool 🙂Are there any test comparing how the network stack of Linux now compares to FreeBSD? These Patches makes Linux more viable for more use cases

  • @user-oj9iz4vb4q
    @user-oj9iz4vb4q Před 2 měsíci

    Organizing structs chronologically does have a technical reason. If you access a pointer to a new struct with code which expects the old struct definition, then it still works. It's binary backwards compatible. It doesn't work for arrays but it does work for individual structs. This behaviour can be very useful in systems that need to persist and go through many iterations like a kernel.

  • @lalitkale
    @lalitkale Před 3 měsíci +2

    To be precise - its networking subsystem optimization.

  • @saidbakr
    @saidbakr Před 4 dny

    So, if my Linux kernel is 6.8, would the ping response time be shorter by 40% to my router? Or does it need my router use 6.8 too?

  • @netman87
    @netman87 Před 3 měsíci

    Does this matter with virbr between virtual servers?

  • @darknetworld
    @darknetworld Před 2 měsíci

    I wonder if the A.I would try to reduce lag?

  • @srivarshan7925
    @srivarshan7925 Před 3 měsíci

    I am thinking that cannot we implement this way of optimization within compiler itself.... This is algorithm problem to just order element in struct to reduce the memory alignment.

  • @korgmangeek
    @korgmangeek Před 3 měsíci

    Should we say de fragmentation of memory? Or maybe de fragmentation of cache memory?

  • @HA7DN
    @HA7DN Před 3 měsíci

    The question to ask is: does it break any usermode ABI?

  • @BenUK1
    @BenUK1 Před 2 měsíci

    Very interesting. It also made me think about the fact that there is a conflict between how us humans would logically group variables together for readability and how the data should be organised in memory for optimal cache utilisation, and the fact that there's no programming language that I'm aware of that explicitly deals with these 2 conflicting requirements.
    Maybe we need a language where you can explicitly separate these 2 requirements, for example:
    struct ColouredTriangle
    {
    unsigned int X1 : GroupA;
    unsigned int Y1 : GroupA;
    unsigned char Red1 : GroupB;
    unsigned char Greed1 : GroupB;
    unsigned char Blue1 : GroupB;
    unsigned int X2 : GroupA;
    unsigned int Y2 : GroupA;
    unsigned char Red2 : GroupB;
    unsigned char Greed2 : GroupB;
    unsigned char Blue2 : GroupB;
    unsigned int X3 : GroupA;
    unsigned int Y3 : GroupA;
    unsigned char Red3 : GroupB;
    unsigned char Greed3 : GroupB;
    unsigned char Blue3 : GroupB;
    }
    Here we have the source laid out in a logical order for humans to read, but the "Group" suffix tells the compiler to actually group them in memory differently to how they are laid out in the source.
    That was just a noddy example of one way to do it, but I'm sure there would be better ways to separate these 2 concerns in an actual language.

  • @FredPhillips32169
    @FredPhillips32169 Před 3 měsíci

    IW10 was a huge improvement and it took forever to catch on.

  • @973sandman
    @973sandman Před 3 měsíci

    If zero packets are received you have zero cache miss, the compiler can't know at compile time. Kudos to Google's people for identifying the variables to be grouped, wondering if there's some runtime profiler for Linux to help with that.

  • @za_ozero
    @za_ozero Před 3 měsíci

    My only question is can it invalidate networking drivers and such

  • @alessi4249
    @alessi4249 Před 2 měsíci

    Also goes to show how optimised it was that this level of optimisation has that much of an effect

  • @Murph9000
    @Murph9000 Před 2 měsíci

    Compilers can't reorder traditional C structs, as the struct is essentially a recipe for encoding and decoding a binary blob of data. This is relied on for things like accessing protocol headers, where the data order is externally defined. Additionally, the struct will be defined in one place (usually a header file), but may be used in many source files; the compiler doesn't know all of the places it is used when compiling individual source files.

  • @ehudgavron9086
    @ehudgavron9086 Před 3 měsíci

    Hussein, thank you for the in-depth analysis. Also h/t to Michael@Phoronix because he's on top of everything.
    Is this good - yes in terms of results.
    No in terms of layer violations.
    So, in simple terms (for the readers) if we choose to improve the end result (a good thing) but to do so we must violate the layers of the protocol, the software, the system, and the CPU, our underlying assumptions that work well today become accepted and may not work further down the road.
    Simply put: TCP only requires a usable IP underlayer. IP only requires a usable physical (PHY) layer. The physical layer only requires a usable OS and Drivers (Winderz, Linux, BSD, MacOS, etc.). That final part is the hardware, which in this case is unchanged but future versions may have future hardware that doesn't play as nicely with those same OSs and drivers.
    When you violate the layering, you make assumptions about lower layers or upper layers that are likely true and useful TODAY for YOUR setup. They may not apply tomorrow or for everyone's setup. It's unreasonable to expect future users to benchmarks whether the changes are good or bad for their setups... and so we avoid layer violations.
    As an experimenter -- good job Google.
    As a consultant -- good job Hussein and good job Michael Larabel @ Phoronix.
    As someone who likes to leave a job done for good and never be called back... not so much.
    Ehud
    Tucson
    Arizona
    US

  • @akshay-kumar-007
    @akshay-kumar-007 Před 3 měsíci

    Is this patch specific to how the compiler behaves? Lets say Linux kernel uses gnu c compiler which doesn't re-arrange the variables in struct and memory for fields of a struct is assigned sequentially by compiler in the order its declared.
    Now lets say, we had clang as compiler instead and maybe clang compiler does some optimization, causing the struct variables to not be in user specified order.
    Its been really long since I have programmed in C so I have no clue tbh. Maybe some C expert can answer.

    • @akshay-kumar-007
      @akshay-kumar-007 Před 3 měsíci

      Sorry, saw the last few seconds of video and Hussein also has kind of hunch about it but not sure.

    • @davidmartensson273
      @davidmartensson273 Před 3 měsíci +2

      Rearranging the parts of a struct run the risk that some code uses a pointer and an offset to fetch just one part of the struct, that would fail if you rearrange things and not update any code doing strange things.
      And one of the reason C and C++ are fast is that the developers CAN to strange things if they find they improve performance.
      So a compiler would have to make massive amounts of analyses to determine if this specific struct would be safe to rearrange.

  • @lhxperimental
    @lhxperimental Před 3 měsíci

    Compiler cant own it's own decide to change the order of struct members. Structs are often serialized as is. So the order of bytes on wire would change if the members are recorded.

  • @AlejandroRodriguez-wt2mk
    @AlejandroRodriguez-wt2mk Před 3 měsíci +1

    The order of things actually matters!!

  • @RustedCroaker
    @RustedCroaker Před 2 měsíci

    The better question is:
    How relocation of variables in a structure to please a cache made Intel 3.31% slower?

  • @Zelidar
    @Zelidar Před 2 měsíci

    As we all know, the Linux kernel was first released in 1991, by Linus Torvalds. Since that time, Linus and thousands of other sharp-minded software engineers have relentlessly continued to develop and refine it. It is pleasing to see that such significant improvements can still be made 32 years later; ... it is also a bit surprising, I must say.

  • @jonathancrowder3424
    @jonathancrowder3424 Před 3 měsíci +1

    Really got to hit that line if you want the speed

  • @frusso78
    @frusso78 Před 2 měsíci

    Would this type of optimisation also find place with other natively compiled languages such as Rust or Swift?

    • @PaulSpades
      @PaulSpades Před 2 měsíci

      No. Neither guarantee data structure integrity except for very basic data structures, C does.

    • @brdrnda3805
      @brdrnda3805 Před 2 měsíci

      @@PaulSpades Rust does if you define your struct with #[repr(C)], which was introduced to do what C does.

  • @pablocoyote9647
    @pablocoyote9647 Před 3 měsíci

    No intel ipv4 or did I miss it ?

  • @offBeatRock777
    @offBeatRock777 Před 3 měsíci

    Can this be back ported to 5.x please 😅

  • @henrymach
    @henrymach Před 2 měsíci

    Interesting. The obvious followup question is: can it be done to everything else? And are they doing it already?

    • @PaulSpades
      @PaulSpades Před 2 měsíci

      1. Yes. 2. I'd guess no. It's probably not worth the Google employee's time to touch any other kernel code outside the networking stack. Who is "they"?

  • @Devasia.Thomas
    @Devasia.Thomas Před 3 měsíci

    It'll be nice to know how this affects older or current cpus with less cache. Datacenters churn its infra every few years, so it definitely helps them for sure.

    • @edwardmacnab354
      @edwardmacnab354 Před 3 měsíci

      just use a different kernel

    • @Devasia.Thomas
      @Devasia.Thomas Před 3 měsíci

      @@edwardmacnab354 Thats always an option if you have a problem with it. I was just curious on how it affects older stuff :)

  • @konfushon
    @konfushon Před měsícem +1

    not to be a party pooper, but the order of variables in smart contracts in Ethereum matters a lot, like a lot! This is so that we save gas. So seeing this tweaking of variable order in Linux makes me happy honestly.
    Hey, seems like engineering concepts are the same no matter where you work in

  • @jotec1290
    @jotec1290 Před 2 měsíci +1

    The better question would be: How many Megawatts of Energy Google saved by this patch?

  • @Ray14324
    @Ray14324 Před 3 měsíci

    When you add something to a struct its usage is in general more limited as its usually part of a new feature or mechanism. At that point you don't know how often that part will be used. Over the course of years its use may change. So the one who added the line can't possibly take into account cache misses on each structure change, especially in big code bases. Your expectation for a developer to take that into account is unjustified. Every couple of years its justified to perform a measurement on cache hits, as a separate task for optimization purposes. Those things usually come later into the game

  • @nicholasfigueiredo3171
    @nicholasfigueiredo3171 Před 3 měsíci +1

    The compile can't organize your struct to make more efficiency since you will only know at runtime. You need to make it manually after running it at least once. It is also not a good idea for the compile to re-organize your struct since some optimization rely on it being that way you need a specific reason to change so you do manually afterward in cases like this

    • @davidmartensson273
      @davidmartensson273 Před 3 měsíci +2

      Its also about memory layout, in C a struct is continuous memory and you can for example cast one struct into another and read the same data in another way, the compiler will not be able to know if any user of the struct does anything like that and therefore cannot change the order without risking breaking changes.

  • @jozsab1
    @jozsab1 Před 3 měsíci +1

    256 Mb cache ... back in the days had 56 MB RAM ( yes, my dream was to add another 8MB ) under DOS :D

    • @phillipsusi1791
      @phillipsusi1791 Před 2 měsíci +1

      That's a really weird configuration. 3 sticks of ram, 32 + 16 + 8?

  • @nghiaminh7704
    @nghiaminh7704 Před 2 měsíci +2

    game dev calls this data-oriented design

  • @pdlloyd
    @pdlloyd Před 3 měsíci

    Why reorganize variables in the kernel source code and not optimize placement at compile time in the compiler? Even though it has a significant improvement, it now creates dependence on the entire hardware system doing things in a certain way and not changing. Linux is best supported building with GCC but does this speedup hold true for Clang and other compilers? With different build-time flags? With different processor architectures? It's now making style a part of syntax in a way that is not immediately obvious just reading the code.

    • @davidmartensson273
      @davidmartensson273 Před 3 měsíci +2

      As already stated in many comments, structs are often serialized and deserialized as byte streams, and sometimes in different parts of the code, reordering would need to know of every such case.
      And sure in this case it might not be relevant, but the compiler does not know how this struct is different from another struct so you would have to build specific optimizations for just the kernel the, and by then, optimizing the kernel code is just easier and safer ;)

  • @jeffcauhape6880
    @jeffcauhape6880 Před 2 měsíci

    It's funny that no one thought of this before.

  • @RH-jj7ck
    @RH-jj7ck Před 3 měsíci

    Is this same thing working for ARM based processors?

    • @smlgd
      @smlgd Před 2 měsíci

      Yes, to any processors that have cache really. So pretty much every Arm CPU cores except for cortex-M4 and below

  • @AndersonPEM
    @AndersonPEM Před 2 měsíci

    Reorder a struct
    GETS 40 PERCENT IMPROVEMENT
    Now that's a level of optimization I can only dream of achieving :)

  • @hadekhae.f.5847
    @hadekhae.f.5847 Před 3 měsíci

    I live at the mercy of these engineers, hope they stay fit and sharp 🥹

  • @zoso25
    @zoso25 Před 2 měsíci

    I'm curious as to why a packed struct couldn't help here.

  • @amilamad
    @amilamad Před 3 měsíci

    Compilers can use profile guided optimization.
    I am surprised linux kernal has these kind of un optimized code.

  • @ZipplyZane
    @ZipplyZane Před 3 měsíci +3

    The title is a bit confusing. "40% performance" implies that it is only 40% of the previous performance--i.e. lower.
    Maybe add the word "improvement" to the end of the title.