120x Faster Algorithm By Nested Loops

Sdílet
Vložit
  • čas přidán 8. 07. 2024
  • Recorded live on twitch, GET IN
    / theprimeagen
    Reviewed video: • Adding Nested Loops Ma...
    By: / @depth-buffer
    MY MAIN YT CHANNEL: Has well edited engineering videos
    / theprimeagen
    Discord
    / discord
    Have something for me to read or react to?: / theprimeagenreact
    Kinesis Advantage 360: bit.ly/Prime-Kinesis
    Hey I am sponsored by Turso, an edge database. I think they are pretty neet. Give them a try for free and if you want you can get a decent amount off (the free tier is the best (better than planetscale or any other))
    turso.tech/deeznuts
  • Věda a technologie

Komentáře • 306

  • @MichaelAbramo
    @MichaelAbramo Před 8 měsíci +954

    It takes a true legend to make a 15 minute demo on optimization that he probably spent a month making and then at the end say "but instead you can just use this library that's 8x faster"

    • @khhnator
      @khhnator Před 8 měsíci +44

      but only if your problem is algebra... the thing is that you will be always having cache performance issues.
      is really a mine of free performance gains if you can optimize the hotspots of your code

    • @chrism3790
      @chrism3790 Před 8 měsíci +41

      Well, those libraries were created by someone who went through the optimization process in a much more thorough way than presented here, and that is interesting in and of itself.

    • @lesscommonsense1804
      @lesscommonsense1804 Před 8 měsíci +7

      True legends give the sauces.

    • @pithlyx9576
      @pithlyx9576 Před 7 měsíci +3

      @@chrism3790 yea but optimizing for your specific use cases will be more performant than a catch all solution in most situations. additionally i think the video is also telling a story that can be applied to many sectors of optimization.

    • @chrism3790
      @chrism3790 Před 7 měsíci +3

      @@pithlyx9576
      I disagree with that. Most of the time, people overestimate their ability to hard core optimize things. It takes deep knowledge to truly extract every ounce of performance from your hardware. If you're working on well known problems, trying to do these things yourself is a waste of time that will produce suboptimal results, almost guaranteed.
      Ask yourself, would you be able to optimize matrix multiplication yourself, to the degree that it was done in this video? And remember, even the author admits there is more under the hood of the referenced implementation.

  • @christophervsilvas
    @christophervsilvas Před 8 měsíci +471

    I’m doing full stack coming from a game development background. As a game developer I had to deal with this stuff all the time and so when I got my first full stack job I wrote everything like a high performance C++ algorithm even though it was just a typescript nodejs server. At first I thought I was always just a bad programmer and didn’t even know it but now I am realizing good code means different things for different contexts. In game dev it means minimal operations to achieve best result. In web it means minimal complexity and maximum maintainability by the broadest set of programmers. It’s almost like you have to program for the “worst” programmer in the “room”.

    • @boastfultoast
      @boastfultoast Před 8 měsíci +8

      If you don't mind me asking - how did you swap from gave dev to full stack? What steps did you take to learn/fill in gaps for full stack

    • @christophervsilvas
      @christophervsilvas Před 8 měsíci

      @@boastfultoast Hey, the full story is a little complicated (I didn't exactly "swap", it's more like I got a good job from someone who didn't care what I knew, they just knew I could make what they need happen). Anyway, to answer your question I would really just say have a problem that you are interested in and solve it! Build something!
      I didn't really have the luxury of learning fullstack in the abstract. I started by creating a Shopify app with a Typescript NestJS backend, and NextJS admin ui front end. Redis for caching. MongoDB for db.
      You have to be okay with writing everything baddly for a while so you see the problems and go look for a solution. I can't tell you how many times I have suffered through ugly programming experiences only to find a feature I didn't know about that makes everything so much easier. After two years of development I am currently on our third major refactor. Jump in with a simple project and be okay with sucking and not knowing how much you suck.
      Side note: if you come from a typed language I would really recommend learning Typescript. It was a huge help for me to get started. (If you don't come from a typed language I would still recommend Typescript :D).
      I just recently "relearned" how important it can be to have a project to work on in order to learn. I saw the Primagen talking about Rust and decided to give it a try but couldn't really make heads or tail of it for a whole week... Luckily my boss told me to make a simple service for querying every zipcode in the USA against the AT&T and Verizon coverage maps. I started dissecting both AT&T and Verizon's APIs using Python as a quick mockup. Then I decided to try to create it in Rust. At first it was really really slow... but then it just started to fall into place and after 3 days I know Rust pretty well.
      That all may or may not be helpfull. If you have any more specfic questions, feel free to ask!

    • @doc8527
      @doc8527 Před 8 měsíci +17

      I think you still got some wrong ideas from the video (not everything but how you view web and game in terms of difference). The video is the opposite of what you are doing. For a given "specific size of dataset" or "specific data structure", sometimes simply dumping the nested array might run faster (node.js in your example, V8 is doing some cache magic behind the scene), which is what a lot of devs are doing in web. Even though they don't realize that, and many in-place optimizations might lead to slow performance because the requirement change constantly, your old in-place optimization only works for specific size of data, then the stupid slow nested solution out-perform yours.
      The web is quite dynamic, so premature optimization is pretty bad. If you are just working on some local offline game, not the game server that has a drastically change of traffic all the time. Then your in-place optimization will be best because you know exactly what you will get. For web server development, a lot of times is hard to predict the incoming traffic or outgoing traffic (as db size grows). the stupidest slow solution can have the maximum reliability (regardless the size and structure ) and flexibility later if you need to make the change.
      The video is about optimization wouldn't work uniformly, size wise, structure wise, some works the best for the given range or given structure.
      I have the opposite experience from another side, I actually want to try out game development from fullstack (but traditionally just software development). I constant see game devs are doing over optimization without knowing the reason, and lead to unreadable and slow code, but they tend to so pretentious that they are the smartest, like those leetcode obsessives. Many are lack of computer science knowledge, but hey, game devs are really passionate (better than traditional software developers by a margin), I really really like them doing some creative thing. At the end is more case by case, person by person, I wouldn't generalize my experience to the entire field, whether it's game development or web development, whether who is smart or not, it's up to the specific area/problem you are working on.

    • @BeefIngot
      @BeefIngot Před 8 měsíci

      I want to know this too. Not because I think its impossible at all, just curious. I would imagine for a motivated individual it just takes building your own project 0 user service to a reasonable enough quality to feel confident you know what is going on, front to back @@boastfultoast

    • @martinvuyk5326
      @martinvuyk5326 Před 8 měsíci +7

      Yeah I also started on a more C performance track then fell on Web dev. It's absolutely about maintainability. But there are corners here and there where I still get a kick on optimization, especially now that I'm doing rather Data engineering on DB. I had to reimplements something with numpy the other day and it was not really as readable as before, but I'll take 22 seconds over an hour every day lol

  • @CattoFace
    @CattoFace Před 8 měsíci +68

    16:10 "transposing the walk" works without hurting performance only when it doesnt cause cache misses, earlier in the video, he did not account for cache size yet, so it caused misses.
    but now the block is in cache so the transposition doesnt cause extra misses.

  • @DexterMorgan
    @DexterMorgan Před 8 měsíci +50

    I'm a prewatch viewer. I watch it once, write down notes, then watch it again and pretend I know the answers while crossing my arms and nodding my head with a smug look on my face. Classic.

    • @ThePrimeTimeagen
      @ThePrimeTimeagen  Před 8 měsíci +28

      That is the way to do it

    • @YDV669
      @YDV669 Před 4 měsíci +2

      A fellow after my heart. I do that with PVR. Once the thing watches the movie I told it to record, I feel I no longer have to and can get on with my day without too much waste of time.

  • @PythonPlusPlus
    @PythonPlusPlus Před 8 měsíci +55

    21 GB/s equates to 5.25 GFLOPS because floats are 4 bytes, and every operation has to load a float from memory. So 21/4 = 5.25

    • @markkalsbeek5883
      @markkalsbeek5883 Před 27 dny

      But it's multiplication, so you need to fetch 2 floats right?

  • @notapplicable7292
    @notapplicable7292 Před 8 měsíci +241

    Cache optimization has been really eye opening for me over the last year.

    • @petrkiyashko4248
      @petrkiyashko4248 Před 8 měsíci +14

      Same here, and also branch prediction and cpu pipelining lately

    • @alexhiatt3374
      @alexhiatt3374 Před 8 měsíci +19

      same! I think people focus too much on "heap vs. stack" instead of "cached vs. uncached". the stack is way faster because it's basically always in cache, but heap usage doesn't necessarily need to be slow.

    • @astronemir
      @astronemir Před 8 měsíci +1

      It’s also all wasted effort whenever cache layout size cpu instructions change.

    • @yegorzakharov8514
      @yegorzakharov8514 Před 8 měsíci

      Cache me daddy outside

    • @TheSnowMan2023
      @TheSnowMan2023 Před 8 měsíci

      @@Jim-gyswry i think it depends on the problem you are trying to solve and the framework/languages and packages/libraries you are using.

  • @martijn3151
    @martijn3151 Před 8 měsíci +114

    The single most important take here: always profile your code. Don’t assume, measure.

    • @unbreakablefootage
      @unbreakablefootage Před 8 měsíci

      Ye and look at the instructions generated

    • @DFPercush
      @DFPercush Před 7 měsíci +1

      -O3 or -Ofast , and -march=native are good for squeezing out the most performance out of your compiler on your test rig. If you control the server, or in source distros, you can leave it that way, but for public binary distributions it's good to have a well defined set of features to target, like plain x86, -mAVX, etc.

    • @martijn3151
      @martijn3151 Před 7 měsíci +1

      @DFPercush those compiler flags do come with a risk. I’ve seen games suddenly behaving differently when using Ofast for instance, which is definitely not something you want when being in optimization phase. There is a reason as to why optimization flags can be switched on and off; for some applications the shortcuts generated by the compiler are acceptable and the speed increase is worth it. For other applications, the speed increase isn’t all that much, but the risks are. So, be careful with using them would be my advise

    • @blindshellvideos
      @blindshellvideos Před 5 měsíci

      what does the -march flag do anyway?@@DFPercush

    • @DFPercush
      @DFPercush Před 5 měsíci

      @@blindshellvideos -m is used for a bunch of different options. I think it means "machine" or something like that. One of them is "arch" = "architecture". In other words what kind of CPU to compile for. This is how you can cross compile for other types of machines. But "native" means whatever processor the compiler is currently running on. This will allow it to take advantage of any advanced features you have on your machine like AVX, encryption acceleration hardware, etc. Otherwise, you can specify those features with flags like -mavx2.
      man gcc | grep -C 20 -- '-m' | less
      and look for x86 options.

  • @m4rt_
    @m4rt_ Před 8 měsíci +84

    16:40 if you are doing game programing and know what size the matrices will be, for example 4x4, you can just hardcode the multiplication.
    Since the size is set you don't need General Matrix Multiplication (GEMM), which is the algorithm he is optimizing.
    With GEMM you don't optimize for the specific size, you make something that can handle all sizes, but then may not have the best performance.
    When you know that the size will always be 4x4 you can make it do it the most optimal way.

    • @heavymetalmixer91
      @heavymetalmixer91 Před 8 měsíci +4

      Asking from a novice's perspective: Can you put an example in C++ here of how you would hardcode a 4x4 optimized multiplication?

    • @theairaccumulator7144
      @theairaccumulator7144 Před 8 měsíci

      ​@@heavymetalmixer91use a library like glm

    • @Ulchie
      @Ulchie Před 8 měsíci +24

      @@heavymetalmixer91 My assumption is that you just manually unroll any loops and have every multiplication laid out, probably using SIMD intrinsics to really optimize it.

    • @m4rt_
      @m4rt_ Před 8 měsíci

      @@heavymetalmixer91 I don't have a C++ version handy, but I have a Jai version. It should be relatively readable. (btw --- means that it is uninitialized).
      multiply :: (m: Matrix4, n: Matrix4) -> Matrix4 {
      result: Matrix4 = ---;
      result._11 = m._11*n._11 + m._12*n._21 + m._13*n._31 + m._14*n._41;
      result._21 = m._21*n._11 + m._22*n._21 + m._23*n._31 + m._24*n._41;
      result._31 = m._31*n._11 + m._32*n._21 + m._33*n._31 + m._34*n._41;
      result._41 = m._41*n._11 + m._42*n._21 + m._43*n._31 + m._44*n._41;
      result._12 = m._11*n._12 + m._12*n._22 + m._13*n._32 + m._14*n._42;
      result._22 = m._21*n._12 + m._22*n._22 + m._23*n._32 + m._24*n._42;
      result._32 = m._31*n._12 + m._32*n._22 + m._33*n._32 + m._34*n._42;
      result._42 = m._41*n._12 + m._42*n._22 + m._43*n._32 + m._44*n._42;
      result._13 = m._11*n._13 + m._12*n._23 + m._13*n._33 + m._14*n._43;
      result._23 = m._21*n._13 + m._22*n._23 + m._23*n._33 + m._24*n._43;
      result._33 = m._31*n._13 + m._32*n._23 + m._33*n._33 + m._34*n._43;
      result._43 = m._41*n._13 + m._42*n._23 + m._43*n._33 + m._44*n._43;
      result._14 = m._11*n._14 + m._12*n._24 + m._13*n._34 + m._14*n._44;
      result._24 = m._21*n._14 + m._22*n._24 + m._23*n._34 + m._24*n._44;
      result._34 = m._31*n._14 + m._32*n._24 + m._33*n._34 + m._34*n._44;
      result._44 = m._41*n._14 + m._42*n._24 + m._43*n._34 + m._44*n._44;
      return result;
      }
      and here is a simplified version of the struct Matrix4
      Matrix4 :: struct {
      _11, _12, _13, _14 : float;
      _21, _22, _23, _24 : float;
      _31, _32, _33, _34 : float;
      _41, _42, _43, _44 : float;
      }

    • @oblivion_2852
      @oblivion_2852 Před 8 měsíci

      I have a strong suspicion that the simd registers being wider than the data... With the added latency would actually make it slower than just raw assembly on the memory addresses.

  • @drooplug
    @drooplug Před 8 měsíci +13

    Matt parker wrote a python script that he admitted wasn't very good, but it got thebjob done. It took something like 30 days to complete on his laptop. He shared it with his viewers and a number people rewrote it to run faster. I think the first rewrite reduced the rub time ti a few hours in python. Eventually, someone submitted code in another language that gor it down to something like 2 ms.

    • @GabrielSoldani
      @GabrielSoldani Před 8 měsíci +4

      That’s pretty different, though. Parker used an inefficient algorithm. The community first found more efficient algorithms and then optimized it for cache hits and parallel execution. This video uses the same algorithm from start to finish.

    • @Z3rgatul
      @Z3rgatul Před 4 měsíci +1

      ​@@GabrielSoldaniare you sure library doesn't use Strassen's algorithm?

  • @sinom
    @sinom Před 8 měsíci +57

    Relating GB/s to Gflops is something that's often done in high performance multithreaded programming. If you look at a CPU that can handle 64Gflops. Every one of the floating operations will be using 2 8byte numbers and then outputting 1 8byte number. That would mean if every operation would require its own new data to be loaded the required throughout would be 128GB/s. But your actual bandwidth is probably, gonna be more like 8GB/s. So you somehow need to try fully utilizing all the flops with much less data. In matrix algorithms this is often done by basically loading tiles one after another. And then iterating over those. This will introduce more nested loops but it will also significantly reduce the amount of data loading required per floating point instructions, allowing for much higher utilization of the CPU. This is called tiling, and by now a lot of compilers do a simplified version of it on their own, but they don't know your exact flops and bandwidth specs, so if you do know them you can get out even more performance than your compiler can. Basically the exact same thing is also done when doing matrix maths on the GPU, since there there's also a huge bandwidth limit most of the time. Just in the GPU case you usually just use a library that already implemented it for you like Thrust or rocThrust

    • @BeefIngot
      @BeefIngot Před 8 měsíci +4

      God, what a name those libraries have.
      You really just thrust that upon us.

    • @gauravnegi6857
      @gauravnegi6857 Před 8 měsíci +1

      ​@@BeefIngotas a computer science noob I can't understand a lot, but that's really crazy amazing.

  • @sinom
    @sinom Před 8 měsíci +33

    No simply walking differently would mean you'd also have to walk differently on the other matrix so that would just shift the problem from one matrix to the other. The reason it works later on is because of that temporary copy being made.

  • @tordjarv3802
    @tordjarv3802 Před 8 měsíci +122

    I think this level of optimization is more common in scientific computing than in game development since SC uses way more data (think peta to exa bytes) than in game development.

    • @0dsteel
      @0dsteel Před 8 měsíci +13

      Those probably use specialized hardware (for massively parallel computations), aka not the cpu. Cuda magic and kernels go brrrr

    • @tordjarv3802
      @tordjarv3802 Před 8 měsíci +21

      @@0dsteel I'm in that field, and yes accelerators are sometimes used such as GPU, but you will be surprised how much research code still runs on CPU only.

    • @0dsteel
      @0dsteel Před 8 měsíci +1

      @@tordjarv3802 can imagine how 10^n vs 10^(n+2) ticks of simulation per second makes a difference, games are pretty shy on things happening pre tick in comparison

    • @OREYG
      @OREYG Před 8 měsíci +10

      Depends how you define scientific computing. In gamedev you usually have to crunch pretty considerate amount of data in 16ms, so such optimizations are mandatory. In scientific computing you are not bound by constraints of having to output something in real time. Then there are some numerically intensive applications that cannot use fastest available option (for example they require a certain accuracy). Then there are libraries, that are available to scientists, that already employ some of those techniques. What I'm trying to say, that in gamedev it's done as a necessity, while in scientific computing as an afterthought.

    • @0dsteel
      @0dsteel Před 8 měsíci +4

      @@OREYG Depends how you define constraints. A game engine doesn't have to produce 60+ frames a second, but it helps with sales. Take an engine that runs simulations. 1 second vs 2 minutes to simulate the same model for 1 year delta time is huge.
      Little Tommy experimenting with python scripts for his PhD thesis is not really what comes to my mind at least.
      --- Edit ---
      ;)

  • @Impatient_Ape
    @Impatient_Ape Před 8 měsíci +11

    The FORTRAN spaghetti code I used for research calculations back in the day used the LAPACK library, which, at the time, was the most optimized code for linear algebra in existence.

    • @abdullahahmad1642
      @abdullahahmad1642 Před 8 měsíci +2

      It still is in most cases, though there's a Rust library called Faer being developed that looks like it could give it a run for its money...

  • @emanuelfarauanu1760
    @emanuelfarauanu1760 Před 8 měsíci +11

    One of my tasks during an internship way back in 2018 was to do such optimisations in C++ on a Raspberry PI, I learnt so much from that. Really cool algs.

  • @viniciuspjardim
    @viniciuspjardim Před 8 měsíci +17

    In my understanding he only stopped transposing the matrix in the end because of the buffers. Without the buffers, a transposed copy of the matrix will lead to more cache hits. With the buffer we already have a copy. So he fuse two steps in one, transposing it into the buffer, so it's still transposed copy when accessing the buffer memory.

    • @gregorymorse8423
      @gregorymorse8423 Před 4 měsíci

      Transpose load from cache vs transpose load from RAM. I thought the same.

  • @manofacertainrage856
    @manofacertainrage856 Před 8 měsíci +4

    Wow. I haven't seen this stuff for a while ... or felt the terror of implementing the first few optimizations for a grade. That's after two parallel programming classes in my master's that included OpenMP and many matrix multiplication algorithms. Took a CUDA course a few years afterward and some of the access/transpose problems are still a problem, but the CUDA libs are the answer there too. Awesome video!

  • @PhilfreezeCH
    @PhilfreezeCH Před 8 měsíci +3

    Last year I helped optimize the math library for my universities RISCV multi-core microprocessors (you can also buy commercial versions called GAP8 and GAP9).
    It was entirely this stuff, just with Fourier transforms and so on.
    We started with C code and by the end almost all inner loops were assembly, it was pretty fun honestly.

  • @woolfel
    @woolfel Před 8 měsíci +1

    it's a good video. Lots of people have talked about how bad cache hit miss affects performance, but the video explained it in a simple way that more people can understand. In the java world, lots of talks about mechanical sympathy have addressed this topic.

  • @Kotfluegel
    @Kotfluegel Před 8 měsíci +4

    For small size matrices up to four dimensions, you usually employ specific types for each dimension, because then you can literally unroll all the loops. So, no need to use block matrices for graphic applications.

  • @ArkDEngal
    @ArkDEngal Před 8 měsíci +3

    loop ordering (locality, locality, locality) was so important that I had at least 3 classes in classes that covered it in college.

  • @leschopinesns100
    @leschopinesns100 Před 8 měsíci +2

    I've literally implemented vanilla GEMM in parallelism course at uni, multiplying big ass matrices.

  • @m.sierra5258
    @m.sierra5258 Před 6 měsíci +1

    I optimized this very problem in a university competition, and won the competition. One thing they missed in this video: line stride. Not every memory can be put into ever cache line. In most cases, there is something like modulo 256 applied to the address to decide which cache bucket something gets put into. Sadly, a matrix of 1024 means that every item of a column gets put into the same cache bucket. So for us it increased throughout substantially to apply a padding of one cache entry to every matrix row. This has the effect that columns can be cached efficiently.

  • @PiratePawsLive
    @PiratePawsLive Před 8 měsíci +15

    This was awesome, I knew quite a bit about how memory and CPU's work. But the optimization was a bit of beauty even if the code looks like an aesthetic war crime xD. Love it.
    Now I'm interested in learning more about optimizing again :).

    • @devaereo
      @devaereo Před 8 měsíci

      the code aesthetics are really not good. I'm downvoting it

  • @alfred.clement
    @alfred.clement Před 8 měsíci +1

    I was mind blown when I performed byte shuffling using x64 SIMD with the SSSE3 instruction set, specifically the pshufb instruction.
    I've eventually optimized reordering/transformation of complex data with vector instructions, resulting in a significant impact on real performance.
    It was about x88 faster... It matters A LOT, especially when you're dealing with low-latency operations.

  • @suede__
    @suede__ Před 8 měsíci +2

    That's a lot of computer science.

  • @rogierderuijter7009
    @rogierderuijter7009 Před 8 měsíci

    Never seen prime this quite. Amazing video

  • @zkreso
    @zkreso Před 5 měsíci +1

    Can't wait to implement this optimization in the ERP software I write at work

  • @josefpharma4714
    @josefpharma4714 Před 6 měsíci

    Transposing the memory layout of a matrix can improve cache hits a lot.

  • @alexcristache8830
    @alexcristache8830 Před 6 měsíci

    This is really cool. I had to do a very similar thing for one of my university courses called High Performance Computing. Basically simulate fluids using Lattice Boltzmann on a cluster of the university's supercomputer. Was done in C and had to pay attention to all the optimizations he talks about there. Memory access, cache, using OMP and so on. It was really cool, but extremely challenging. I guess this is used more in academia for different simulations like the one I mentioned.

  • @HyperFocusMarshmallow
    @HyperFocusMarshmallow Před 8 měsíci

    Watched this video a while back. It’s really good.

  • @filiformis
    @filiformis Před 8 měsíci +1

    I encountered this stuff in my Computer Architecture class in college, but I haven't had the opportunity to use this knowledge it yet.

  • @chaselewis6534
    @chaselewis6534 Před 8 měsíci +2

    On the question for games does this scale for N=4 or just larger matrices. N=4 fits into 4 SIMD registers so you can do the entire set of operations actually in register. In fact you generally load both into register do a 4x4 transpose on one in register then do the multiply since transposing a 4x4 in register is only a couple instructions and lets you do all the multiply adds in parallel afterwards. So their is another level optimization at that point, but that is just because the matrices fit entirely in register. Once you get outside fitting everything in registers this is how GEMM works.
    There are more optimal algorithms though that get into fancier math that lets the matrix calculation scale below N^3 and more like N^2.4 or so.

  • @Blubb3rbub
    @Blubb3rbub Před 8 měsíci +1

    Matrix Multiplication becomes even more interesting if you reach the "I can't reasonably fit my matrix into memory"-size of matrices and need to employ MPI and distributed memory.

  • @besknighter
    @besknighter Před 8 měsíci +1

    In games, we rarely use huge matrices, if ever. However, depending on the game, the amount of matrices multiplications itself is so so soooo big that it might be useful to optimize it this much.

  • @josephlabs
    @josephlabs Před 8 měsíci +1

    That’s crazy 😂. I aspire to be doing this one day soon.

  • @BleachWizz
    @BleachWizz Před 5 měsíci

    5:13 - I think the idea is really to transpose the matrix in memory, we want the columns to be close together so when we read the first block in cache from the first multiplication all subsequent numbers possible for the next multiplications would be brought into cache. otherwise indices would be in intervals on the dimension of the matrix and you would read a lot less numbers at a time up to rereading memory every multiplication.
    16:17 - the problem with walking is that you have to check if you CAN do that, in this case, localA and local B are both being completely brought into the cache, because now you're calculating different parts of the matrix at a time so the result needs to be completely fit in the cache as well or THIS PART will start causing problems.

  • @theoballet1519
    @theoballet1519 Před 8 měsíci

    another optimization not mentioned here is using the super scalar nature of our processors by creaing a temp vector that stores the product of elements from the two lines, the multiplication will be performed in parrallel and the sums will be performed when the prior mult have been done. (this results in more code and more loops, but more perf too, especially for big lines since you wont have too much miss in branch prediction)
    this is an optimisation used by BLAS. another thing could be using INT representation since modern CPUs have more int functionnal units that floats.

  • @TQuantP
    @TQuantP Před 8 měsíci

    I wanted to watch this video for a while now, just sitting there on a tab so old it wad beggining to gather dust. Thankfully now I can! Thanks Prime 😂

  • @QayLikeKay
    @QayLikeKay Před 2 měsíci

    Even small cache line sizes are 32 bytes long, so a 4x4 float matrix would be 16 bytes, 2 of them would be 32. So both matrices could fit in one cache line and the result another (in the worst case scenario, in a 64byte cache line system everything *could* potentially fit in one cache line). People in chat saying your wrong forget that pretty much everything in computing is a trade off and many optimizations come at the cost of some fixed amount of setup time (or a setup time that scales slower than the normal case). As one of my professors once said "Bubble sort beats small children" (english was not his first language), but he meant that bubble sort beats algorithms like quicksort for small child data sets as it can just go, it doesn't have to allocate extra buffers, or do anything else. That said "bubble sort beats small children" sticks in your brain way better.

  • @Peregringlk
    @Peregringlk Před 4 měsíci

    7:24 If you don't care about the order of elements of the array, you can remove an element at any position by swapping it with the last value and then pop. That way you don't need to "move" things around.

  • @ArkhKGB
    @ArkhKGB Před 8 měsíci

    Blast from the past: had to do the same kind of things in engineering school. Try to optimize matrix multiplication, be happy with the results (same thing about finding cache sizes but no SIMD at the time). Then you whip out BLAS and it out performs your work one or two order of magnitude.
    And yeah, those big matrix multiplications are useful in most simulations for things like weather forecasting or fluid simulation.

  • @zayniddindev
    @zayniddindev Před 8 měsíci +14

    In the first minute, I figured out that this video is not for me (yet) xD

    • @russelllapua4904
      @russelllapua4904 Před 8 měsíci

      Didn't even get that far 😂

    • @llothar68
      @llothar68 Před 8 měsíci +5

      You are never to young to learn about cache locality.
      The matrix multiplication is just a use case, no need to understand that.

    • @shahbazkhan-ek7hp
      @shahbazkhan-ek7hp Před 8 měsíci

      28seconds😁

  • @shapelessed
    @shapelessed Před 7 měsíci

    3:20 - Technically, if you have a 32-bit float, which is 4 bytes, you can squeeze about "5 TFLOPs" through 20GB/s memory bus, I think that's the analogy here...

  • @zataritamods7499
    @zataritamods7499 Před 6 měsíci

    Transposing the matrix reminds me a lot of the concept of Swizzling, also doing the blocks reminds me a lot of the tiling method used by the ps2. It feels like this is generalizing the concept for matrix multiplication.
    Also really helps me conceptualize why someone would do something like this.

  • @dominikmuller4477
    @dominikmuller4477 Před 3 dny

    there have been recent improvements to the matrix multiplication algorithm. Just like multiplication of scalars has some convoluted improvements, you can multiply some sums of sub-matrices and get away with vastly fewer multiplications.

  • @GRHmedia
    @GRHmedia Před 5 měsíci

    This used to be the standard of what was taught in college when it came to programming. It is why they used to require you to learn ASM before you could take C and C++.
    A project I worked on 6 years ago was to create a large maze with many rooms attached about 40,000. The first version would take days to finish. So rewrote it and it was taking hours. The rewrote it again got it down to 30minutes. Another rewrite and I was down 8min, then 2.5 min, then 30seconds. Finally version 0.9seconds.
    That project was purely on a single thread at 3ghz a8-3850.
    I've seen much larger changes in performance stuff that would run weeks and months reduced to seconds. Most of those also ended up having a change in language used as well.

  • @tomaszrusnak5684
    @tomaszrusnak5684 Před 8 měsíci

    Jako inżynier budownictwa pracuję na progtramach wykorzystujących obliczenia wytrzymałościowe właśnmie na macierzach, i im więcej danych (drobniej podzielona konstrukcja na elementy skończone) tym dokładniejsze wyniki otrzymujemy. Od ponad 10 lat obliczenia nie przyspieszają, i obliczanie macierzy o wymarach 200 000 x 200 000 elementów sąjuż wyzwaniem, nie mówiąc o tym, że wiele solverów jest napisachych bez wykorzystania wielu procesorów. Postanawiam na poważnie podjąć naukę programowania (nie tlko hobbistycznie jak do tej pory) aby rozpocząć w branży budowlanej optymalizację solverów. Świetny film.

  • @thapr0digy
    @thapr0digy Před 8 měsíci

    8:43 Prime realizes that this guys maths is blowing his mind.

  • @GRHmedia
    @GRHmedia Před 5 měsíci

    The stated amount a CPU can do is based on if the program runs entirely inside cache. However if the program has to reach out to memory for new variables then it is limited by memory bandwidth.

  • @oliverer3
    @oliverer3 Před 8 měsíci +1

    This is the stuff that makes me question if I can even call myself a programmer. Any optimization I've ever had to do has been very basic stuff. I can't even begin to wrap my head around this.
    Then again I've never really done any compute-heavy programming.

  • @swannie1503
    @swannie1503 Před 8 měsíci +1

    I’ve had to do quite a bit of cache optimization. Rust is very good for it actually. I enjoy the problems. Similarity matrices with 100s of thousands of members features. It makes a huge difference.

  • @ivanjelenic5627
    @ivanjelenic5627 Před 8 měsíci +1

    Also, if you know something about the matrices you can decompose them and do stuff more efficiently for some operations.

  • @7th_CAV_Trooper
    @7th_CAV_Trooper Před 4 měsíci

    two guys having an argument over which compiled language is faster and this guy walks up, transposes some matrices, lights the computer on fire and then tells them it could be faster if he just walked the array backwards and walks out of the room.

  • @MrHaggyy
    @MrHaggyy Před 7 měsíci

    I´m doing automotive stuff. For small matrices, reorganizing doesn`t work that well. You load or create the matrix in the right order. What we do is "sliding" across the matrices by rotating pointers over the memory we are given. This allows you to fill your SIMD operation with the demand from multiple mathematical operations. Also, there are many hardware solvers for matrix multiplication, that can run in parallel to the CDU.

  • @richardpowell1425
    @richardpowell1425 Před 5 měsíci

    The applications for large matrix multiplication include engineering and science. Important to get fast code.

  • @mubin-ansari
    @mubin-ansari Před 8 měsíci +1

    Prime without hoodie? This is going to be a great article

  • @devaereo
    @devaereo Před 8 měsíci +1

    This video made me feel dumb and I want to go and study linear algebra back again

  • @HyperFocusMarshmallow
    @HyperFocusMarshmallow Před 8 měsíci

    8:35 This is the point where you see in Primagen’s eyes how his brains got fried by maths.

  • @ward7576
    @ward7576 Před 8 měsíci

    2:58 that's about the type of optimizations that a company, doing WordPress plugins, would ask you to do in your technical interview

  • @nexovec
    @nexovec Před 8 měsíci +3

    It's probably true that you don't have to transpose the matrix if it's float32 4x4, because the entire thing is 64 bytes, which is one cache line, so in that extremely common case it can actually be made even faster, you just load it into the SIMD registers directly from the definition, done.

    • @vladlu6362
      @vladlu6362 Před 8 měsíci

      Except that, if you are running m/n matrices and not n/n matrices, it's not in a single cache line. It's only in the same cache line if you can guarantee that they are allocated all at once, and that's not always the casem

    • @nexovec
      @nexovec Před 8 měsíci

      @@vladlu6362 I only meant 4x4 times 4x4, which is a pretty common case... What do you mean by allocated all at once. Never would I imagine anything else than having them contiguously in memory.

    • @kazedcat
      @kazedcat Před 8 měsíci

      ​@@nexoveccache alignment problem for example you have non vector data like the determinant that you need for your calculation over and over so you want them stored next to your 4x4 but the determinant value will cause your 4x4 to straddle two cacheline.

    • @nexovec
      @nexovec Před 8 měsíci

      @@kazedcat Sure, always #align 64 or whatever the thing was.

    • @DFPercush
      @DFPercush Před 7 měsíci

      @@kazedcat That's the main advantage of a struct-of-arrays approach. The matrices can live in an alignas(64) array, and the det's or whatever can live in an alignas(4) array. You just have to write container.matrix[i] and container.det[i] instead of container[i].matrix and container[i].det .

  • @seancooper5007
    @seancooper5007 Před 8 měsíci +1

    That's some for loop nesting right there

  • @kitastro
    @kitastro Před 8 měsíci

    9:14 I like that music that comes in

  • @user-ud8hw4gp6t
    @user-ud8hw4gp6t Před 3 měsíci

    11:47 this is called a native array, since it is fixed in memory and you cant push

  • @omdevs
    @omdevs Před 8 měsíci

    This is amazing!

  • @d_9696
    @d_9696 Před 8 měsíci +1

    This level of optimization is probably only needed for numerical applications and gaming

    • @bennymountain1
      @bennymountain1 Před 8 měsíci

      optimization and gaming in one sentence LMAO

  • @enviidakid
    @enviidakid Před 8 měsíci

    lmfaooo this video got me in tears right now

  • @russelllapua4904
    @russelllapua4904 Před 8 měsíci +2

    This video has solidified that maths just isn't for me.

  • @Veptis
    @Veptis Před 8 měsíci

    This seems like great engineering. Food to be able to simply import mkl

  • @hannessteffenhagen61
    @hannessteffenhagen61 Před 8 měsíci

    For small matrices the main thing is that parallelising the work for multiplying two matrices makes no sense anymore, the overhead of distributing the work would far exceed the actual work that needs to be done. Some of the other things mentioned in the video may still make sense though.

  • @KamillaMirabelle
    @KamillaMirabelle Před 8 měsíci

    We needed to that in a homework assignment at a course 😅

  • @nocopyrightgameplaystockvi231

    Just going to use Matmul to escape this Armageddon

  • @markuszeller_official
    @markuszeller_official Před 8 měsíci

    Uncle Bob is going to die if sees these nested loops.

  • @Hr1s7i
    @Hr1s7i Před 6 měsíci

    3:40 It's the same relation as the one between moments and velocity in mechanics. The CPU can perform that many billions of floating point operations in a single second. Meanwhile the RAM can "process" or store/cycle, that many in a single second. RAM is a capacitive hard wired reactive circuit, which means you have to wait for the rather long transient processes to do their thing. Couple that and the fact that you expend part of the bandwidth for the sake of communication (CPU's memory controller is the equivalent of a train station tossing trains at other stations, be it memory, gpu or in some cases SSDs) and you might end up with seemingly low performance under certain conditions. A good example of how this weakness can surface is when you operate on enormous active data sets. In the meantime, your PC can look like an absolute monster when playing a video game, where once you load the data necessary for the simulation to run, very little of it changes on the logic side of things. As a matter of fact, most game behaviours are de facto hard coded specifically to prevent overhead. In the meantime, driving a PID sim in Matlab can easily employ 4-5 cores when done in a non cheaty way.

  • @deldrinov
    @deldrinov Před 8 měsíci

    I'm a novice in programming at best but I've encountered something similar once.
    I wanted to add attributes to each member of an array which themselves were in an array; pretty much extending 2D array if my lingo is correct.
    I did that in 2 nested for loops, and the initial code ran like crap. I swapped the for statements without any other changes and speed increased 40x.

    • @DFPercush
      @DFPercush Před 7 měsíci

      There's a similar thing in image processing. It's usually better to loop for (y=...) on the outer loop and for (x=...) in the inner loop. That way you're accessing memory contiguously in a horizontal manner and not skipping several cache lines on every pixel.

  • @forinda
    @forinda Před 8 měsíci

    I have never seen @primeagen have this agree to almost everything. This guy is fluent in math

  • @burloiumarian2665
    @burloiumarian2665 Před 8 měsíci

    some examples of those libraries you mentioned for nodejs ?
    just curious :D

  • @Felix-ve9hs
    @Felix-ve9hs Před 8 měsíci

    I barely know anything about algorithms, programming, math, or optimization, but it's pretty impressive how knowledge about hardware can improve performance :)

  • @gingeral253
    @gingeral253 Před 8 měsíci +1

    This cache optimization has went against the programming principles I had learned.

  • @Gamex996
    @Gamex996 Před 2 měsíci

    I did once optimize python code from 50min to 5min with some C binding libraries and multithreading + multiprocessing using 12cores/24threads cpu
    felt great, the best feeling I ever had after that hot garbage worked and later threw it at colleague to finish it

  • @guywithknife
    @guywithknife Před 8 měsíci

    Regarding “is that what I said before, that we can walk the matrix on a different order instead of transposing” - no, the earlier algorithm required transposition because of memory prefetching, the reason he can walk it differently in the later version is that the block fits entirely into cache this time. Walking in a different order doesn’t affect cache when the entire thing fits in cache. So while you did say that you could do this, it doesn’t work for the initial version, only for the block version.

  • @mannycalavera121
    @mannycalavera121 Před 8 měsíci

    I'd just be happy to make the visualisations

  • @michaelt126
    @michaelt126 Před 7 měsíci

    Im pretty sure modern compilers should already use a combination of prefetch and blocking of large arrays without having to include additional libraries,, maybe this is not a standard for all languages, but its worth knowing prior to adding a bunch of redundant code.

  • @rotors_taker_0h
    @rotors_taker_0h Před 8 měsíci

    Then you find yourself counting execution port's limits, micro-op cache, branchless logic, whatever.
    The very next level of optimization (for those curious enough) is to separate "logic" of transforming data (matrix multiplication or whatever) and "schedule" (correct blocking, SIMD, multithreading, etc) and let a compiler or special optimizer do it for you. And they can optimize _several_ operations in the best way. Google for stuff like Halide, Taichi, TVM, Tiramisu compiler, etc. This stuff may optimize for different hardware too (DSP, GPUs, etc). They make camera pipelines with that, neural networks, rendering, all kind of stuff. Computers are crazy fast and crazy difficult to program.

  • @spamviking8591
    @spamviking8591 Před 8 měsíci +2

    Some of the best devs I know suffer from premature optimization.

  • @sanderbos4243
    @sanderbos4243 Před 8 měsíci +2

    7:03 If you have a Set of 10 items, rather than transforming it into an array and doing O(n) removal by shifting everything over, you can just make it O(1) using the "swap-remove" `arr[i] = arr[--len];` that swaps the last element into the position of the removed element. Note this presumes it's a Set, so the order doesn't matter, but you said it was a Set. :)

    • @YumekuiNeru
      @YumekuiNeru Před 8 měsíci +1

      was not the point that the O-notation does not take into account how the data structure is stored in memory so an array will probably be faster even if it naively says O(n) or whatever

    • @Marius426
      @Marius426 Před 8 měsíci

      ... unless you want to keep it sorted for binary search access, then you do have to move the elements. But again, for small enough sets it doesn't matter there either.

    • @sanderbos4243
      @sanderbos4243 Před 8 měsíci

      @@YumekuiNeru Sure, but my single-line suggested approach of removing items from the array is an improvement of the approach prime suggested where you're moving way more items. Why shift over say 5 items when you can just swap. It does depend on it being a Set, but that was prime's hypothetical.

    • @sanderbos4243
      @sanderbos4243 Před 8 měsíci

      @@Marius426 If your elements are integers and you want to find an element in it, you can do it in O(1) as well rather than binary search's O(log N). You just need to maintain a second array that is the same length, where you index it with the item you're looking for and it tells you what index it's at. No hashing required. This of course doesn't really matter for performance when dealing with 10 integers, but with more integers it can. I hear you say "but my elements often aren't integers", but you can let those integers be indices into an array of the structs or whatever you're looking for. Think of them as entity IDs. This costs you another level of indirection, but on the other hand swapping integers is faster than swapping large structs. But this is admittedly going quite far, just wanted to bring up the possibility.
      "We can solve any problem by introducing an extra level of indirection."
      "…except for the problem of too many levels of indirection,"

  • @johanngambolputty5351
    @johanngambolputty5351 Před 8 měsíci

    Did I see a Manjaro terminal, hell yeah! (oh and something like astronvim too, respect)

  • @hkravch
    @hkravch Před 8 měsíci

    Bro is using that star trek “inser technobabble” thing

  • @diegocassinera
    @diegocassinera Před 8 měsíci

    you need the data from memory to registers do the operation in registers and move the result to memory

  • @BrainySmurf77
    @BrainySmurf77 Před 8 měsíci

    I spent a large part of my PhD doing optimizations like this to my CFD codes so I could graduate faster.

  • @mbarrio
    @mbarrio Před 8 měsíci

    How would one cache optimize using rust?

  • @bourne_
    @bourne_ Před 5 měsíci

    6:50 that's like using functional language vs object language. Not really, but kind of 😅

  • @cblair1353
    @cblair1353 Před 8 měsíci

    Linear algebra is fucking magic.

  • @apollolux
    @apollolux Před 8 měsíci

    Even just adding the technique of transposing the second matrix to ease up the indexing loop (regardless of how much matrix transposition costs in your implementation) is one of those things that you just have to see in action visually to better understand and appreciate. This video was fantastic if only for that.

  • @S-we2gp
    @S-we2gp Před 4 měsíci

    The world would grind to a halt if every programmer had to deal with this, just solve this for me bro and give me a library and some rules.

  • @hydrobolix3365
    @hydrobolix3365 Před 8 měsíci

    And if you walk into Google with a nested On^2 youre fired

  • @plurallyplurally7948
    @plurallyplurally7948 Před 8 měsíci

    What matter is required in order to learn about this?

  • @Stay_away_from_my_swamp_water

    Just imagine competing with this guy for a job

  • @mcafalchio
    @mcafalchio Před 8 měsíci

    So much optimization, looks like he needs an ASIC to run ir

  • @marble_wraith
    @marble_wraith Před 8 měsíci

    yo dawg i heard you like nested loops. So i got some nested loops for your nested loops.

  • @yuack2398
    @yuack2398 Před 8 měsíci

    As a researcher using HPC computers in daily life for physics, this kind of performance gain is really helpful. But the problem is, most of users of HPC platform knows nothing about computer science itself