Google Patches Linux kernel with 40% TCP performance
Vložit
- čas přidán 6. 06. 2024
- Google submitted a patch to Linux Kernel 6.8 to improve TCP performance by 40%, this is done via rearranging the tcp structures for better cpu cache lines, I explore this here.
0:00 Intro
0:30 Google improves Linux Kernel TCP by 40%
1:40 How CPU Cache Line Works
6:45 Reviewing the Google Patch
www.phoronix.com/news/Linux-6...
lore.kernel.org/netdev/202311...
Discovering Backend Bottlenecks: Unlocking Peak Performance
performance.husseinnasser.com
Fundamentals of Backend Engineering Design patterns udemy course (link redirects to udemy with coupon)
backend.husseinnasser.com
Fundamentals of Networking for Effective Backends udemy course (link redirects to udemy with coupon)
network.husseinnasser.com
Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)
database.husseinnasser.com
Follow me on Medium
/ membership
Introduction to NGINX (link redirects to udemy with coupon)
nginx.husseinnasser.com
Python on the Backend (link redirects to udemy with coupon)
python.husseinnasser.com
Become a Member on CZcams
/ @hnasr
Buy me a coffee if you liked this
www.buymeacoffee.com/hnasr
Arabic Software Engineering Channel
/ @husseinnasser
🔥 Members Only Content
• Members-only videos
🏭 Backend Engineering Videos in Order
backend.husseinnasser.com
💾 Database Engineering Videos
• Database Engineering
🎙️Listen to the Backend Engineering Podcast
husseinnasser.com/podcast
Gears and tools used on the Channel (affiliates)
🖼️ Slides and Thumbnail Design
Canva
partner.canva.com/c/2766475/6...
Stay Awesome,
Hussein - Věda a technologie
Get my udemy fundamentals of backend engineering backend.win
Kudos for a good explanation. Today, I learned!
hello! thanks for the insightful video:) is it possible for you to publish your content on linkedin learning? some organizations have a subscription for that instead of udemy.
Another tactic that could potentially be used by the Linux kernel to speed up the network stack would be to pin one or more CPU cores to exclusive use for packet pipeline processing. Pinning would mean taking said core away from the kernel scheduler as a resource to be managed during context switching of thread execution.
This is a very well established tactic used by the Intel Data Plane Dev Kit (DPDK) and is referred to as Polling Mode Driver (PMD) approach. Such pinned cores execute code at 100% CPU cycle time, never block on anything - nor are they preempted by the kernel. In DPDK terminology, a pinned CPU core is referred to as an lcore and there is a DPDK API to execute C functions on said lcores. (Modern CPUs have so many cores that it warrants pinning some cores for specialized processing.)
So I have a DPDK program where I can configure how many lcores to dedicate to a lcore worker pool. Am using C++17 for this program. Various pipeline operations get established as constructed callable objects, which means that if I have a callable object stack variable called action, I can just invoke it as action(), or just treat it as a piece of aggregate data.
Said callable object supports C++ move semantics. If am configured to be executing the entire application on just a single lcore, then it just immediately invokes the callable object on that lcore context. If there are multiple lcores, then the action is C++ moved into a lock-free work queue. All of the lcores in the worker pool are in an endless loop of retrieving a callable object from the work queue and then just invoking it. Each lcore is assigned a unique token. The callable objects are published using this set of tokens - the publisher just ring cycles through the set of tokens as it publishes work items. The consuming lcores retrieve work items as identified by their respective token. This averts contention between lcores rendezvousing on the work queue and provides a fair, round-robin, load-balanced processing.
This kind of network programming greatly exceeds what can be accomplished by writing conventional networking applications that rely only on the kernel network stack and conventional multi-threaded concurrent programming. And such applications run entirely in user space - never touching the Linux network stack. So the high performance data plane pipelines never block, never transition to kernel mode, and never get preempted by the kernel scheduler.
And naturally data structures are tuned around the cache line size.
ah, clickbait
theres multiple things that coco did with "just moving around variables" that helped with the performance that you didnt even realize. :D
When the data is fetched and a cache miss happens the cpu fetches the whole memorypage putting it in the TLB cache. Reorganizing it in the way that neighbouring variables are contextually dependend to each other means that you have less roundtrips to the heap. But the other thing coco did was he organized the variables in a way that the compiler didnt pad them as much. Because compilers tend to 32bit align variables. so if you use a bunch of smaller variables (u8) between bigger ones (u32) you will get alot of padding resulting in a way bigger struct. That in return means that you will have more chances of the data you need not fitting in 1 memory page returing in more heap reads.
Ram bus go vrum vrum?
How do you even approach such a thing. How do you analyse it 🤔
@@petter9078 trial & error… 😅 and loads of hours in ghidra & the godbolt compiler explorer
@@petter9078 If you know about the problem, just looking at a struct will be enough to see possible "holes".
I never did anything on this level but back in the MSdos days you reorganized the order drivers was loaded because the OS started in the top and fetched the next free area the driver would fit into, but it never backtracked to check areas already skipped, so by rearranging the driver order you could get it to put a smaller driver in a space that otherwise got left empty.
That way you could minimize the amount of memory below the 640 kb limit that the actual programs had access to, and that could be the difference between being able to run some games or not.
Similar concept.
@@petter9078Processors expose performance counters. It ain't perfect, but you can see roughly where the cache misses are.
Corollary: some wrong variable placements will give you 30% performance degradation.
It's not about variable placement but organising data in memory.
@@kartik4792 wasn't the variable placement causing how the data was organized in memory?
@@kartik4792yes. organizing variable is how you organizing the memory. Genius.
@@kartik4792 : Trying to distinguish variable placement and memory placement as different things will sometimes cause massive misunderstandings. Your pedantic attempt at such a distinction is bad and you should stop.
@@absalomdraconis i dont think its that pedantic, iirc the compiler can decide where on the stack a variable will be placed thus if two variables are declared after each other there might be no garuantee that they also are laid out next to each other in memory.
A standard C optimization technique back in the old days when you cared about performance - alignment your data structs via padding using bit fields to how the CPU fetches data from memory and fills it caches.
Programmers nowadays don't care about it.
wow, where can I read more about this?
It's not that you don't care, it's that for the most part you don't have access to it. If you're not doing ASM, C, C++, similar, chances are you DON'T have low level access to ensure optimizations actually work.
Rule 1: never optimize
It has been many decades since I did it.
This usually was when you dealt with CPU's that wanted word aligned data structs and could not deal with unaligned data fetches like the Motorola 68000 which would give you an interrupt.
You would do it like this by hand in your structs and use unions to overlay it to 32 bytes word 16 bytes.
Later your C compiler would automatically pad your C structs/unions to make sure they fell on word aligned boundaries.
The Intel memory controller was just smarter about it and did more than one memory fetch impacting performance
@@austinscott4695
Not many of us write kernel drivers or OS code.
Game industry care about performance - you ever tried writing code for a PS3 with it's oddball architecture???.
Believe me you going to be staring at profilers and how fast the GPU is being fed with data and the CPU is being kept busy and how much memory is being consumed.
@@zoenagy9458
I miss system programming, this was standard when writing drivers/kernel modules etc.
High level programming, fast CPUs and large caches has spoiled everyone.
More like the Linux kernel is so dang big now that optimization projects like this get overlooked. When Linux was running on 8MB systems (with X!) much more attention was paid to small details.
Can the compiler help anything here?
@@ssl3546 There shouldn't be any "optimization projects". If people paid attention to the efficiency of their code _when the first f'ing write it,_ no one would ever need to come back later to rewrite it. But that takes intelligence and time; things modern programmers don't have. (make things that can be sold, and get them to market as fast as possible. quality isn't necessary.)
Its partially because everyone thinks its like javascript or python, and dont realize that C is quite simple and straightforward.
it's always features over optimizations, the latter can come later at any point in time, just like this patch
Chill guys... It's not about superiority, it's about productivity. Google do this because they can reduce their server cost significantly, not just because they're capable to do so. There are tons of micro optimizations left to be done out there but nobody cares because it doesn't make money. It's that simple. Follow money not your ego.
No one said otherwise . Of course at the end of the day this pr benefits google in more than one ways
I agree with you. People will give attention details when they can see $$ in their eyes.
Is the improvement for both ends of the connection? Meaning on both client and server?
Without looking into it beyond this video, I'd say it'll benefit both. That said, this type of optimization is dependent on scale, so the client will likely see negligible benefit unless they have a large number of connections open simultaneously.
@@chetan_naikThe improvement is measured in CPU seconds spent divided by the duration of a TCP round trip. So there's no real data in the patch notes in regards to whether this helped with the latency portion.
Basically, it depends on whether the bottle neck is the CPU side or the NIC side.
Love that you share the latest changes or improvements in the industry with easy-to-understand explanation. Love the passion and thank you for the sharing! Could you please share more about OS/memory/cpu level stuff when codes are exeuted in the future videos?
So basically all that they did is moved things around to maximise the amount of data inside the structs that put sequentially would fit inside a 64-byte chunk of memory...
Let's say I need to handle a request, and I need the variables X and Y, but there are 80 bytes of other stuff between them. Now the CPU needs to request those two variables in two separate requests. If X and Y were next to each other, they would most likely fit inside that 64B block of memory, which would then get cached. When we're finished with X, we don't need to re-request Y separately, because it has already been cached from that 64B chunk we requested previously. We just cut our memory access time by 50%.
I didn't think about these sorts of things originally, but then I decided I'm a masochist and wrote a virtual filesystem with native AES encryption. I really learned a lot about memory when designing it, especially since AES works on 64-bit/8B chunks, so everything, every sector and block had to be structured perfectly...
You explained it very clearly and in a very simple way. Thankyou.
I thought there was only AES-128 and AES-256, with 128 bits being a block size of 16 bytes, not 8 bytes..?
Thanks for the explanation. By your "all they did" i assume this was simple to do or see? Then why hasn't this been done ages ago? 40% improvement is no joke.
@@rupertsmixtapes812 Yeah, sorry forgot whether it was 8 or 16 bytes because it was like 2 years ago and haven't really needed to dive down into the details of any encryption algorithm since then, thanks.
@@giannismentz3570 I clicked the video because I was interested, I heard about CPU caching, memory, then after the word "structs" I suddenly realised "Hell, they only just describe the type and order of data inside memory, I bet they just moved things around so that the most needed props are closer together and faster to read."
And as described in the video. Most people these days use, or start out their journey using high-level languages that don't give a damn about order of data, much like JavaScript and TypeScript's structs/interfaces which only just serve you at compile time. ThePrimeTime had recently posted a video about how junior devs are "noobs", where they discussed how people have stopped learning how to truly understand the code is actually doing on their machines, mostly because their performance is so large that people started focusing more about writing good looking code that is fast to understand and work with, rather than squeezing the performance they probably will never need.
The abundance of computing resources made these hyper-optimisations uneconomical, it's more profitable to roll out features than to refine existing ones.
When you go to a store, do you waste your precious time to find a plastic bag somewhere in your home and bother bringing it with you, or do you just buy a new one every time? They are cheap and abundant...
It's just the way struct packs variables into memory. Going back to the basics!
Yeah, but one should always check with current architecture specification. What worked for i386 might be wrong for i686.
Superb, its just like reading a page when it comes to database. Which is why clustered table has its benefits. Its all about optimising for the next action.
Thanks for covering this!
That is a very good descriptions. Surprising that something so simple makes such a bid difference. I would have expected this to be a compiler optimization, and I hope it will be in the future.
Thanks Hussein ❤
this is fascinating.. recently I was exploring how golang does padding on struct variables to align those variables to fit the 32/64 bit word length so that memory reads align
never occurred to me it would bring this huge improvement.. I thought it was a microoptimization but I guess it builds up when it repeats (networking stack, TCP connections definitely happen in a huge volume and it's good to see the micro optimization getting multiplied to attain a significant improvement)
Love this! Best part is that it's trivially correct.
Very nice topic and presentation
AFAIK This is as the data shows an AMD specific optimization. Intel uses an, compared to AMD, highly advanced prefetcher that unlike normal hardware prefetchers works not on memory adress acesses but also on the data values themselves. By this a physical reordering of the data in memory does not significantly impact the accuracy of Intel’s prefetcher.
interesting. so that means there could be many other areas of the kernel that could be heavily improved on AMD whereas intel is mostly unaffected.
@@BattousaiHBr Yes, AMDs Processors overall are a bit rough around the edges from a hardware perspective and brute force oriented. One can optimize programs to the quirks of singular architectures (as is done in supercomputing, as you know your hardware and can easily optimise for it), but this generally come with a regression for other architectures, what you should prevent. This change for example is simply better in general for the acess patterns in production and even Intel profits, if smaller, but AMD greater due to their problems in architecture. But because of this the better solution are smarter prefetchers and branch predictiors, to generally improve performance, because we all write shitty code and it’s better if you don’t lose 40% performance due to bad code / hardware architecture, not every program can be optimised to this level, hell, most programming languages don’t even offer this low level memory management. To make all code faster the Platform it runs on must be smarter.
where did you read about this
@@ismbks Its related to my work, you can find information about the different prefetchers in CPUs, in security guides by the manufacturer. Prefetchers and branch predictors can lead, if wrongly implemented, to security vulnerabilities, information about prefetchers or branch predictors is mostly only given if a security vulnerability is found, as they are responsible for a lot of current processors performance gains in architecture.
@@darlokt51are you prohibited from revealing more information by your anonymous employer
Well some compilers organize variable order in structs, but is mostly based on alignment to get efficient memory access. But in this case, aot compilers would have trouble because this kind of optimization relies on the runtime aspects of the problem. Consequently, optimizing this at the compiler level becomes more difficult, given that those compilers are generic
It would break everything if your code would call with those sort of optimizations to other functions that take that struct. It should work with structs that only used by static functions in one file scope or with something like --fwhole-program (which basically makes everything static and exposes nothing outside the elf file), but I didn't ever checked that.
In C, the compiler is required to place the variables in memory in the same order they were declared in the struct. It is allowed to pad them for alignment, but not to change the order.
Really insightful.
Hey Hussein, this is the classic case of keeping memory compact according to it's temporal access, maybe intel's branching prediction was already doing the job of keeping the cache hot and that's why there was not much gain there. Another possibility is that the compiled code differs on the two architectures. Any how I always hear that optimization is architecture dependent, this seems to be the case.
dig your observations. Absolutely ..everyone should look out for things we overlook.
Impressive!
How the small and tiny things have this big impacts!
Subscribed 🎉
whats great about C is the fact that its so deterministic and simple.
Structs? Oh their just basically arrays lol.
Then it starts making sense how to optimize it for the cache
excelent content!
Nice explanation. For your question at the end about auto optimize the order by the compiler. I don't think it will work. I think the struct is related to many things that are depends on the order, like serialization, and saving data to binary files. So, if you serialize or save to file in some order, then you have to deserialize in the same order.
I think the best would be to have an allocation hint - to cacheline boundary (64B currently) for every malloc/kalloc call (or just align everything that way, lol)
Such a beautiful explanation Nasser, feels like listening to song
It’s not only about moving things next to each other but also move things which are updated by different CPUs on different cacheline so they don’t need to be invalidated on other sockets. This was more regularly done (besides having per-CPU struct)
Cant wait for the OS course
Wow!! I have never thought once about how the alignment of C struct fields interacts directly with memory pages and L3 cache. Simple but massively effective. Props to Google for that one!!
I've done your Udemy course, I've watching 40-50 of your videos, I just want you to know I frigging love you
ok wow 40%, you have my interest!!! 😮
The cache line seems entirely synomous with pages in the kernel page cache. Huge performance increases are par for the course. For higher level programmers things like iterators and lazy loading make similarly huge performance improvements not to mention threading and multi-procesing. I am guessing that you may have hit the nail on the head at the end: as variable location may be compiler dependent but putting variables in a struct may then make the compiler then follow that exact order of definition.
'they just moved some variables around'
litterally the job of a programmer.
Hey Hussein, would you make a video about how does it work when we have an index in X column, and we make a query that has a "table.X IN (1, 2, 3, ..., 300) " but there is a big list of parameters inside the In clause? I know it uses the index but seems very slow
you can also lower connection close timeout for essentially free performance
جميل يا حسين كالعادة ❤
The compiler can't auto reorder variables because it breaks ABI compatibility with stuff compiled against your lib
This aspect should be given more importance in the discussion. I assumed that compilers already did these kinds of optimizations. Also, the issue of whether the network hardware stack is so uniform (around the world) that these optimizations work on a majority of routers / switches / other network server hardware - enough to make it into the main kernel.
@@anandsharma7430 There's a very interesting document in the KDE dev guidelines, and one of the biggest no-no's is that you can't change the order or even add/remove non-static data members to a class that is in use (regardless of visibility), because it will force a break on the binary compatibility.
C is so old. Insane to think the performance improvements Linux could get if they fully switched to Rust, since it does reordering and this stuff is internal.
@@laden6675 That is not how it works. While the C language is old, it's compilers are not. They are, in fact, built upon the same llvm backend that powers Rust. And the language is simple enough that loads of optimizations can be applied to it. C is still around for a reason.
Of course, Rust CAN produce faster programs than C, but the inverse is often the case.
Rust edge here is that it's harder to leave a security hole there than it is in C, and that in most cases, that comes at zero performance cost. Harder, but not impossible.
Finally, Rust's ABI is famously not stable so we'll still have to use C ABI to interop with dynamic libs for a while. While the Linux kernel is monolithic, it still needs to expose system calls, that will need to be called from a C program. So they would need to disallow the variable reorder anyway.
@@rafagd C compilers need to be standard compliant, and changing the order of fields from the one that's declared is non-standard.
Even if they could, because of how C programs are structured where a program is built from a collection of object files that need to have stable ABI which are independently compiled, there is much less opportunity for optimization than in Rust.
The vast majority of structs and functions are NOT exposed to the userspace, so this is a strawman.
> C is still around for a reason.
Inertia. It will take time, but over time there will be less and less C code in the software we use.
cool video - I dont think it would make sense for the compiler to reorder the members of the struct. In some cases, I've casted the entire struct to a bytestream, and deserialized it somewhere else. In this case, if its reordering the members of the struct, my deserialization logic would fail on the other side, which would be unexpected behavior.
Completely true. A struct needs to be binary predictable, which is why this works.
L3 cache latency on an EPYC CPU is in the tens of nanoseconds range (19 to 31ns).
In the 1980's all thru the 90's, we did all our programming with this level of attention. Prior to then, I programmed mainly in assembly language -which required extreme attention to every instruction in the listing. BTW: The fact that IBM processors saw only a 1 to 5% increase, could be attributed to Intel possibly having a better caching structure and algorithm to start with.
How do I identify these data reorganization possibilities? Any resources?
Benefits of code runtime analysis using a profiler. You can also benefit code performance by moving around code blocks. Most of the time it isn't about how fast the code is but how fast one can produce the code. There have been many hardware architectural and performance improvements since 1991 and most of the time code just lays dormant and gets reused.
What resource are you currently using to learn more about cache lines and data driven design?
Can you please tell how it was discovered that such an optimization can be made? Do there exist any performance monitoring libraries for this specific purpose? Also, since this is an optimization problem, what is the best approach to find the optimum i.e. was this solution arrived at by human knowledge along with some data analysis or was this problem solved by some Reinforcement Learning approach?
It's kinda crazy that optimisations this big are still possible after 30 years.
Awesome video. For more on this topic of memory alignment and cache lines, I highly suggest Andrew Kelleys video from Handmade cities conference. This is also known as Data Oriented Design - loosely. The gains he got out of his compilation for zig with similar optimizations was insane. And he has real examples from the code. Thanks for the vid Hussein!
I know something about spatial locality(when an address is a accessed then in future near address can be accessed ). This was not obvious that struct should hold variables together adjacent together. How they can miss this at the first place?
Rust automatically optimizes for argument position so your struct is still readable but doesnt require excessive padding.
also, rfc #1358
awesome!
C compilers could in theory optimize structs themselves by re-ordering their members, the problem with that is, that the struct layout is not stored in the compiled code, so if a binary and multiple libraries share the same struct type and are built with different compilers or different versions of the same compiler, the order may differ and then everything would break, that's why compilers won't change the order. Also compiled code relies on the order given at compile time, so every time the order changes, all code that reads/writes from/to the struct, must be re-compiled. Optimizing the order of structs is something programmers must do by hand in C and can only do if no external code relies on a given order (as is the case with the Linux kernel, where non-kernel code doesn't know anything about those in-kernel structures).
In languages other than C, compilers may re-order members of data structures themselves but that is only possible if the compiled code stores the order that was used by the compiler and if all compiled code is written in such a way, that it can adapt to any order chosen and any order change in the future without requiring re-compilation. The downside of such a feature is that all structure access will be slower and the code will be bigger and more complicated. Keep in mind that C is simple, low level and supposed to be fast.
The intel bench seems to have a 200gbs nic vs the amd only 100?
Significant performance improvements with simple realignment.
I wonder if changing tab stops from 4 to 2 will have similar effects.
One has to wonder which performance improvements might be possible by implementing a data orientied design, i.e., the switch from the traditional Array-of-Structs (AoS) to the more cache-friendly Struct-of-Arrays (SoA) memory layout.
Is this theoretical or does anything like that already exist?
@@PR-cj8pd As an example, data oriented design is becoming more common in modern game engines in the form of Entity Component Systems (ECS). For compute intensive workloads such as physics simulations, this approach promises a considerable speedup.
@@PR-cj8pd computer graphics stuff use this a lot, from renderers to game engines, often the whole software architecture is based off that concept
I'd like to know what the relative performance of TCP/IP with respect to Intel and AMD before the optimization was implemented. Is it that AMD was faster and the gap grew, the performance was the same and AMD gained an advantage or Intel was faster in the first place. Long ago Intel used a cache inclusive approach and AMD caches levels were exclusive and so had a higher latency. I'm not sure if that is still the case but it might have some bearing on how to optimize the code.
I can't beleive it...Kernal TCP/IP stack was without padding in mind. Years ago when I used to write Game Engines..the first thing you do with various data structure is to align it with padding. There are tons of pragmas exist on all majority C/C++ compilers to support this. This perf boost is good...but it's like someone forgot to change a Engine oil in a car and than when they change it after years they feel gud about it and Car is running smooth too.
This one didn't have a lot to do with padding, more with spacial locality, things that needed to be used should be kept close together.
Compiler already pads variables for you: all longs will be on addresses alligned to long's size, all ints alligned to size of int, all struct x alligned to size of struct x, etc. This have nothing to do with padding, it's about that your variable is much bigger than cpu can fetch at once and it goes to the memory more. Padding is about your long is missaligned with its size, and because of that CPU must fetch two ints and take only last half of first and first half of second, glue them together and save it somewhere else. That's not the problem here.
This is fucking amazing actually
Great discussion. Now they need to add the same optimizations for the other architectures! Grin…
Thank you,
I hope that you share more content like that, a payed series is also good idea.
Cool 🙂Are there any test comparing how the network stack of Linux now compares to FreeBSD? These Patches makes Linux more viable for more use cases
Organizing structs chronologically does have a technical reason. If you access a pointer to a new struct with code which expects the old struct definition, then it still works. It's binary backwards compatible. It doesn't work for arrays but it does work for individual structs. This behaviour can be very useful in systems that need to persist and go through many iterations like a kernel.
To be precise - its networking subsystem optimization.
So, if my Linux kernel is 6.8, would the ping response time be shorter by 40% to my router? Or does it need my router use 6.8 too?
Does this matter with virbr between virtual servers?
I wonder if the A.I would try to reduce lag?
I am thinking that cannot we implement this way of optimization within compiler itself.... This is algorithm problem to just order element in struct to reduce the memory alignment.
Should we say de fragmentation of memory? Or maybe de fragmentation of cache memory?
The question to ask is: does it break any usermode ABI?
Very interesting. It also made me think about the fact that there is a conflict between how us humans would logically group variables together for readability and how the data should be organised in memory for optimal cache utilisation, and the fact that there's no programming language that I'm aware of that explicitly deals with these 2 conflicting requirements.
Maybe we need a language where you can explicitly separate these 2 requirements, for example:
struct ColouredTriangle
{
unsigned int X1 : GroupA;
unsigned int Y1 : GroupA;
unsigned char Red1 : GroupB;
unsigned char Greed1 : GroupB;
unsigned char Blue1 : GroupB;
unsigned int X2 : GroupA;
unsigned int Y2 : GroupA;
unsigned char Red2 : GroupB;
unsigned char Greed2 : GroupB;
unsigned char Blue2 : GroupB;
unsigned int X3 : GroupA;
unsigned int Y3 : GroupA;
unsigned char Red3 : GroupB;
unsigned char Greed3 : GroupB;
unsigned char Blue3 : GroupB;
}
Here we have the source laid out in a logical order for humans to read, but the "Group" suffix tells the compiler to actually group them in memory differently to how they are laid out in the source.
That was just a noddy example of one way to do it, but I'm sure there would be better ways to separate these 2 concerns in an actual language.
IW10 was a huge improvement and it took forever to catch on.
If zero packets are received you have zero cache miss, the compiler can't know at compile time. Kudos to Google's people for identifying the variables to be grouped, wondering if there's some runtime profiler for Linux to help with that.
My only question is can it invalidate networking drivers and such
Also goes to show how optimised it was that this level of optimisation has that much of an effect
Compilers can't reorder traditional C structs, as the struct is essentially a recipe for encoding and decoding a binary blob of data. This is relied on for things like accessing protocol headers, where the data order is externally defined. Additionally, the struct will be defined in one place (usually a header file), but may be used in many source files; the compiler doesn't know all of the places it is used when compiling individual source files.
Hussein, thank you for the in-depth analysis. Also h/t to Michael@Phoronix because he's on top of everything.
Is this good - yes in terms of results.
No in terms of layer violations.
So, in simple terms (for the readers) if we choose to improve the end result (a good thing) but to do so we must violate the layers of the protocol, the software, the system, and the CPU, our underlying assumptions that work well today become accepted and may not work further down the road.
Simply put: TCP only requires a usable IP underlayer. IP only requires a usable physical (PHY) layer. The physical layer only requires a usable OS and Drivers (Winderz, Linux, BSD, MacOS, etc.). That final part is the hardware, which in this case is unchanged but future versions may have future hardware that doesn't play as nicely with those same OSs and drivers.
When you violate the layering, you make assumptions about lower layers or upper layers that are likely true and useful TODAY for YOUR setup. They may not apply tomorrow or for everyone's setup. It's unreasonable to expect future users to benchmarks whether the changes are good or bad for their setups... and so we avoid layer violations.
As an experimenter -- good job Google.
As a consultant -- good job Hussein and good job Michael Larabel @ Phoronix.
As someone who likes to leave a job done for good and never be called back... not so much.
Ehud
Tucson
Arizona
US
Is this patch specific to how the compiler behaves? Lets say Linux kernel uses gnu c compiler which doesn't re-arrange the variables in struct and memory for fields of a struct is assigned sequentially by compiler in the order its declared.
Now lets say, we had clang as compiler instead and maybe clang compiler does some optimization, causing the struct variables to not be in user specified order.
Its been really long since I have programmed in C so I have no clue tbh. Maybe some C expert can answer.
Sorry, saw the last few seconds of video and Hussein also has kind of hunch about it but not sure.
Rearranging the parts of a struct run the risk that some code uses a pointer and an offset to fetch just one part of the struct, that would fail if you rearrange things and not update any code doing strange things.
And one of the reason C and C++ are fast is that the developers CAN to strange things if they find they improve performance.
So a compiler would have to make massive amounts of analyses to determine if this specific struct would be safe to rearrange.
Compiler cant own it's own decide to change the order of struct members. Structs are often serialized as is. So the order of bytes on wire would change if the members are recorded.
The order of things actually matters!!
The better question is:
How relocation of variables in a structure to please a cache made Intel 3.31% slower?
As we all know, the Linux kernel was first released in 1991, by Linus Torvalds. Since that time, Linus and thousands of other sharp-minded software engineers have relentlessly continued to develop and refine it. It is pleasing to see that such significant improvements can still be made 32 years later; ... it is also a bit surprising, I must say.
Really got to hit that line if you want the speed
Would this type of optimisation also find place with other natively compiled languages such as Rust or Swift?
No. Neither guarantee data structure integrity except for very basic data structures, C does.
@@PaulSpades Rust does if you define your struct with #[repr(C)], which was introduced to do what C does.
No intel ipv4 or did I miss it ?
Can this be back ported to 5.x please 😅
Interesting. The obvious followup question is: can it be done to everything else? And are they doing it already?
1. Yes. 2. I'd guess no. It's probably not worth the Google employee's time to touch any other kernel code outside the networking stack. Who is "they"?
It'll be nice to know how this affects older or current cpus with less cache. Datacenters churn its infra every few years, so it definitely helps them for sure.
just use a different kernel
@@edwardmacnab354 Thats always an option if you have a problem with it. I was just curious on how it affects older stuff :)
not to be a party pooper, but the order of variables in smart contracts in Ethereum matters a lot, like a lot! This is so that we save gas. So seeing this tweaking of variable order in Linux makes me happy honestly.
Hey, seems like engineering concepts are the same no matter where you work in
The better question would be: How many Megawatts of Energy Google saved by this patch?
When you add something to a struct its usage is in general more limited as its usually part of a new feature or mechanism. At that point you don't know how often that part will be used. Over the course of years its use may change. So the one who added the line can't possibly take into account cache misses on each structure change, especially in big code bases. Your expectation for a developer to take that into account is unjustified. Every couple of years its justified to perform a measurement on cache hits, as a separate task for optimization purposes. Those things usually come later into the game
The compile can't organize your struct to make more efficiency since you will only know at runtime. You need to make it manually after running it at least once. It is also not a good idea for the compile to re-organize your struct since some optimization rely on it being that way you need a specific reason to change so you do manually afterward in cases like this
Its also about memory layout, in C a struct is continuous memory and you can for example cast one struct into another and read the same data in another way, the compiler will not be able to know if any user of the struct does anything like that and therefore cannot change the order without risking breaking changes.
256 Mb cache ... back in the days had 56 MB RAM ( yes, my dream was to add another 8MB ) under DOS :D
That's a really weird configuration. 3 sticks of ram, 32 + 16 + 8?
game dev calls this data-oriented design
Why reorganize variables in the kernel source code and not optimize placement at compile time in the compiler? Even though it has a significant improvement, it now creates dependence on the entire hardware system doing things in a certain way and not changing. Linux is best supported building with GCC but does this speedup hold true for Clang and other compilers? With different build-time flags? With different processor architectures? It's now making style a part of syntax in a way that is not immediately obvious just reading the code.
As already stated in many comments, structs are often serialized and deserialized as byte streams, and sometimes in different parts of the code, reordering would need to know of every such case.
And sure in this case it might not be relevant, but the compiler does not know how this struct is different from another struct so you would have to build specific optimizations for just the kernel the, and by then, optimizing the kernel code is just easier and safer ;)
It's funny that no one thought of this before.
Is this same thing working for ARM based processors?
Yes, to any processors that have cache really. So pretty much every Arm CPU cores except for cortex-M4 and below
Reorder a struct
GETS 40 PERCENT IMPROVEMENT
Now that's a level of optimization I can only dream of achieving :)
I live at the mercy of these engineers, hope they stay fit and sharp 🥹
I'm curious as to why a packed struct couldn't help here.
Compilers can use profile guided optimization.
I am surprised linux kernal has these kind of un optimized code.
The title is a bit confusing. "40% performance" implies that it is only 40% of the previous performance--i.e. lower.
Maybe add the word "improvement" to the end of the title.