Video není dostupné.
Omlouváme se.

Optimizing Loops In Go | Prime Reacts

Sdílet
Vložit
  • čas přidán 18. 08. 2024

Komentáře • 184

  • @abdia-lb2yx
    @abdia-lb2yx Před 6 měsíci +107

    I ain't gonna lie having an influencer style videos in the coding world just entertainment non educational helps me stay on track when i feel not motivated to study keep it up bro.

  • @DjoumyDjoums
    @DjoumyDjoums Před 6 měsíci +117

    My guess on why 3 is slower than 2 : index bound checks. Compilers can very often eliminate them when you loop on the actual length of the array, not a pre-cached numerical value.

    • @grifon97
      @grifon97 Před 6 měsíci +9

      sounds like a very good guess to me!:d

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci +19

      @@chigozie123 Not true.

    • @Pipe0481
      @Pipe0481 Před 6 měsíci +21

      @@chigozie123That's just completely false lol

    • @chigozie123
      @chigozie123 Před 6 měsíci

      @@Pipe0481 I should clarify that I'm comparing it to what some C/C++ compilers do. If you look up optimizations done by the Go compiler, you will find that the only optimization even worth mentioning is the conversion of "zeroing out a slice", to a simple memclr instruction. In C, we get that for free using `memset`.
      Maybe also function inlining, but that's nothing given that in C, the `inline` keyword is free to use. That's just the tip of the iceberg. Compare that list to what GCC does, and you will see why it's difficult to claim that the go compiler is an optimizing compiler.
      The main reason why I said that go doesn't do optimizations is that in C, there is no reason why the second and third code snippets should not run at the same speed. The compiler should have optimized the len call out of the second, to make it pretty much the same as the third.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci +4

      @@chigozie123 The for loop with the range clause is clearly more optimized by the compiler compared to standard for loop with the condition clause, as he has shown. I agree with the len call - it should have optimized it, but he measured with the time package, not benchmarked it, and did not show us standard deviations. In Go you can opt for no inline. I simply disagree with the statement that Go compiler doesn't do exhaustive optimizations, as long as they are worth doing and as long as they would not bloat the compile duration.

  • @ColinFox
    @ColinFox Před 6 měsíci +16

    Go uses symbol lifetimes to determine whether to allocate on the stack or on the heap. If the object survives after the function returns, it's on the heap. Pretty straightforward.

  • @samsqu4red
    @samsqu4red Před 6 měsíci +11

    Found the channel just few weeks ago.
    Coming from a background in Webflow, with only a grasp of HTML, CSS, and a bit of JavaScript, I see myself as a puppy. Recently, I've decided to take coding more seriously because it's interesting, and I also have a web app idea that I'm genuinely determined to bring to life.
    Though 90% of what the man is saying still flies right over my head, I feel like my understanding is expanding. I've never attended school, worked in-house with a team, or studied with guidance. It seems like I found a semblance of a mentor who's bombarding me with jargon I can't possibly comprehend, but ultimately is helping me see throw the fog in a way.
    Feels like little by little I'm starting to grasp how everything is connected.. weird.

  • @susiebaka3388
    @susiebaka3388 Před 6 měsíci +17

    "The more you write standard code, the more likely you're generally doing well." - ThePrimeagen

  • @andygarfield6529
    @andygarfield6529 Před 6 měsíci +22

    There’s all sorts of things he could have done here. Changing the order of the loops can give you wild improvements because it can change the proportion of cache hits. Also loop unrolling. Also using arrays in Go (which are a real thing) instead of slices.

    • @CjqNslXUcM
      @CjqNslXUcM Před 6 měsíci

      the loop order is correct, it goes through the innermost slice first.

    • @SaHaRaSquad
      @SaHaRaSquad Před 6 měsíci +7

      If you do something like manual loop unrolling always benchmark it, because most likely it will actually make your code slower by blocking much better automatic optimizations.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci +1

      In my experience, array gives me worse performance than slice. After I checked that, instead of declaring a := [5]something, I always do: a := make([]something, 5, 5)

    • @Zullfix
      @Zullfix Před 6 měsíci +1

      Agreed. His single-array test could have been much faster if the locality of coordinates were stored contiguously.

    • @Yawhatnever
      @Yawhatnever Před 6 měsíci +2

      Loop unrolling might be one of those things where you end up fighting the compiler (would need to measure), but loop tiling might have a real chance of increasing cache hits

  • @eni4ever
    @eni4ever Před 6 měsíci +138

    blue hair no?

  • @HairyPixels
    @HairyPixels Před 6 měsíci +37

    The compiler writes better code than you but what if you're writing a compiler? BRAIN MELTING

    • @ninocraft1
      @ninocraft1 Před 6 měsíci +11

      if you write your own compiler, its probably gonna be even slower xD

    • @Patterner
      @Patterner Před 6 měsíci +10

      just compile the compiler with the compiler. repeat. ... profit

    • @OkAtheistExplainVim
      @OkAtheistExplainVim Před 6 měsíci +8

      if the compiler compiles your compiler is it really your compiler?

    • @HairyPixels
      @HairyPixels Před 6 měsíci +3

      @@Patterner In Soviet Russia compiler compiles you.

    • @joranmulderij
      @joranmulderij Před 6 měsíci +1

      @@OkAtheistExplainVim If you wrote the compiler that wrote the compiler then yes

  • @jagc2206
    @jagc2206 Před 6 měsíci +7

    There is a big distinctions between array an slices in go. This video uses slices, which are pointers to arrays. A slatic array will have a compile time length and allows the compiler to do a lot more optimization

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      My experience with performance is the opposite, so I stick with slices.

  • @philipp04
    @philipp04 Před 6 měsíci +1

    A year ago my prof had us write gaussian elimination, but the algorithm we had to implemeny was slightly different for each of us. I had to find the inverse of a matrix using Cholesky decomposition, which is designed to be fast for large symmetric matrices. I'll cut to the chase, the code I wrote was slow as hell and I couldn't figure out what's going on, though I could tell that I had tons and tons of cache misses. The professor though could tell what I did wrong immediately: the matrix was stored as a 1d array row by row, but I was traversing it columnwise. I transposed the matrix and swapped all indices in my code, and the total runtime decreased by a factor of 20 (on 8000x8000 matrices). Moral of the story: go through your data sequentially, it makes a huge difference.

  • @rt1517
    @rt1517 Před 6 měsíci +3

    The video author should have disassembled the resulting binary to see if he would be able to figure out what the hell is going on.

  • @samhughes1747
    @samhughes1747 Před 6 měsíci +6

    EDIT: this is wrong. Only the comprehension-inlining bit made the boat in 3.12.
    Hey. Python doesn’t have a loop problem anymore. There’s still the “it’s python” problem, but Python 3.12 introduced loop unrolling,

    • @ruroruro
      @ruroruro Před 6 měsíci

      source?

    • @samhughes1747
      @samhughes1747 Před 6 měsíci

      ​@@ruroruro, oof. Yeah, that didn't get approved for 3.12, and I got that mixed up with the comprehension-function inlining that did get released in 3.12.

  • @g.c955
    @g.c955 Před 6 měsíci +1

    he could to a follow up by profiling the code to see where CPU and memory are spent, then analyze the assembly output of the code to understand why.

  • @swedishpsychopath8795
    @swedishpsychopath8795 Před 6 měsíci +4

    I'm planning to throw a SIMULA-67 party in 3 years to celebrate its 60 years as the first OOP language. Hate it or love it the norwegian SIMULA-67 invented Object Oriented Programming (OOP) with Classes, Inheritance, Encapsulation, overloading, overriding and all the other good stuff. It is crazy to think that this paradighm shift in programming is 60 (SIXTY) years already!! Press LIKE if you would like to get an early invitation!

  • @0dsteel
    @0dsteel Před 6 měsíci +2

    Slices are just references to an array with set length and capacity. They are reallocated if the values wouldn't fit said capacity.
    On the other hand, arrays *may* be stack allocated, if the compiler is certain they don't escape the given scope and, well, fit in the stack frame.
    (this is not legal advice, please ask your friendly neighbourhood debugger)

  • @miroslavhoudek7085
    @miroslavhoudek7085 Před 5 měsíci +1

    A guy watches a guy who doesn't know what's going on, while also not knowing what's going on, while people in chat suggest what might be going on, but they can't be all right. Good fun overall, would recommend, but don't know why.

  • @shapelessed
    @shapelessed Před 6 měsíci +2

    Omg GO...
    I only had a week with that language when I tried it but I love the errors as values concept. To the point that I started using them all over my JS.

    • @0dsteel
      @0dsteel Před 6 měsíci

      haters gonna hate, but you mention their uncaught runtime exceptions and they'll be in shambles :)

    • @max3446
      @max3446 Před 5 měsíci

      I like errors as values but Go would be so much nicer if they had monadic operations on errors (or at the very least, some sort of shorthand for `if err != nil { return err }`).

  • @Nenad_bZmaj
    @Nenad_bZmaj Před 6 měsíci +1

    The difference of cca. 250 ms for about one billion elements is to be expected if the additional GC sweeps are triggered. He should have insulated the three scenarios and benchmarked them separately, with the SD, not consecutively in one test. It seems that an additional GC sweep was turned on sometime after the second test. Also, package "time", in my experience, may interfere somehow with either GC or the execution. There are better tools for benchmarking, although I do use "time" to give me a sense.

  • @debemdeboas
    @debemdeboas Před 6 měsíci +1

    reading contiguous memory is generally faster because the compiler may not have to deal with paging and segmentation for every index

  • @Satook
    @Satook Před 6 měsíci

    The 2nd form in part 1 is pretty easy for a compiler to optimise with constant propagation. The x and y don’t change so the sub-expressions can be moved up out of the innermost loop.

  • @yugioh8810
    @yugioh8810 Před 6 měsíci +1

    using only index with range is usually faster especially if the array is long or has heavy structs since it eliminates the copy operation to the variable in the for loop.
    for i := range x
    is faster than
    for _, val := range x

  • @defeqel6537
    @defeqel6537 Před 6 měsíci +1

    Optimization videos like this, I'd always want to know the size of the data, and the theoretical memory bandwidth, etc. to have some actual context. No point trying to optimize things if you are already limited by the hardware (or the optimization would need to be radically different at least).

  • @freedomextremist7921
    @freedomextremist7921 Před 6 měsíci +3

    It's a slice not a list. And Go does also have arrays which are contiguous regions in memory.

    • @Leeway4434
      @Leeway4434 Před 6 měsíci +2

      yes in go arrays are stack allocated and are created with a fixed length. [5]int is an array of length 5 on the stack and []int{1,2,3,4,5} is a slice of length 5 on the heap, pointer to it on the stack. I believe there is a size limit for stack allocated arrays at some point the compiler puts them on the heap* to avoid stack overflows

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      @@Leeway4434 All true.

    • @Leeway4434
      @Leeway4434 Před 6 měsíci +1

      @@Nenad_bZmaj I actually looked into it some more and it is not guaranteed for slices to be put on the heap. The compiler has heuristics for whether or not something should end up on the heap or the stack. Large values usually end up on the heap. Small values (including slices of sufficiently small constant length) can often be allocated on the stack.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      @@Leeway4434 Thanks. I didn't know that.

  • @Gusto20000
    @Gusto20000 Před 6 měsíci

    I’ve done the same thing a week ago after 1.22 released, benchmarking with testing.B and measured exactly the same result for each case… So I decided simple range is good enough. Bill Kennedy talks about “mechanical sympathy” in “Mastering Go” about this exact problem - it’s all about low level caching and prefetching by cpu and compiler

  • @miracleinnocent2649
    @miracleinnocent2649 Před 6 měsíci +1

    Go's allocations all depend on its escape analysis sometimes even interfaces can be stack allocated and I think the highest bytes for stack allocation is 32kb

  • @MattBolt35
    @MattBolt35 Před 6 měsíci

    There are ways to pass link options to go’s compiler to show which values escape. Slices are typically heap allocated. You can stack allocate an array if you use a finite length (as he mentions).

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci +2

      But only small sized arrays.

  • @AlexnderMisuno
    @AlexnderMisuno Před 6 měsíci

    Once I implemented an Mineswepper game in go using "normal" approach to store data about field, cell state etc. And afterwards I did the same but using bit-wise operations (just for fun). To my surprise, not only the performance was the same, but also memory consumption

  • @georgehelyar
    @georgehelyar Před 6 měsíci +2

    I'm not a go expert but it sounds like the reason 3 is slower than 2 and reverse is faster than forwards is because of bounds checking.

    • @defeqel6537
      @defeqel6537 Před 6 měsíci

      which I think is something go should probably be able to optimize away in cases like this

    • @georgehelyar
      @georgehelyar Před 6 měsíci

      ​​​​​​@@defeqel6537 is it a jagged array though? If len(a[0]) can be different to len(a[1]) then it could potentially go out of bounds, but again I'm not a go expert.
      C#, for example, has multi-dimensional arrays of contiguous memory, indexed with a[x,y,z] where the lengths of each dimension are guaranteed, or you could make a jagged array by making an array of arrays where they aren't, and then you'd have to use a[x][y][z].
      Reverse being faster in several languages always seemed like it could be optimised though. Reverse is faster because it always knows that the lower bound is 0 regardless of the array so if it starts inside the bounds and then never goes below 0 then it doesn't have to check bounds for each iteration.
      And of course dynamic arrays could change size at runtime
      Rust probably does bounds checking at compile time

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      @@defeqel6537 I believe that, too. The very typical difference of cca 200 ms for arrays of this size tells me that probably an additional GC sweep was triggered.

  • @rumplstiltztinkerstein
    @rumplstiltztinkerstein Před 6 měsíci

    The notion of "code optimization" is mostly popular because of Javascript. We invented compilers so we don't have to worry about looking at everything the cpu is doing in every clock cycle.

  • @haraldfielker4635
    @haraldfielker4635 Před 6 měsíci +1

    I think the range works better with CPU caches (in that szenario).

  • @Nenad_bZmaj
    @Nenad_bZmaj Před 6 měsíci

    Small arrays can be on stack, but slices are always allocated on the heap (in Go).

  • @michaelrobb9542
    @michaelrobb9542 Před 6 měsíci +1

    You can't put Sys calls in for loops. Calls like time.Now are calls to the OS, this will produce inconsistent results on each run.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      True. There is a bigger for loop in this particular test, encompassing three tests.

  • @sotiriosmantziaris6097
    @sotiriosmantziaris6097 Před 6 měsíci

    The compiler decides based on escape analysis where to put things.

  • @joyboricua3721
    @joyboricua3721 Před 5 měsíci

    @6:22 somewhere a hacker says "Oh yes we could"

  • @awesomedavid2012
    @awesomedavid2012 Před 6 měsíci +1

    My assumption is that if you create a slice with a fixed compile time size and it doesn't outlive the function, Go will simply implement it as an array on the stack. The compiler knows its size and it knows how long it lives.
    Notice that even if you call append, it constructs a new slice. So the original could still be an array on the stack in implementation as long as you don't pass it outside of the function.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      Most arrays and ALL slices are allocated to the heap in Go.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci +1

      Also, when you call append, Go runtime doesn't necessarily create a copied array of twice the capacity. If there is enough capacity in the original array, appending a slice that points to it doesn't create new array. (On the side, a new slice is not declared with reasignment: a = append(a, el) .)

    • @what1heh3ck
      @what1heh3ck Před 6 měsíci

      @@Nenad_bZmajI thought append always create a new slice (still point to the original array if it has enough capacity), am I missing something here?

    • @awesomedavid2012
      @awesomedavid2012 Před 6 měsíci +1

      @@Nenad_bZmaj I agree about the append thing. Though I still think the compiler could theoretically optimize a slice to be on the stack if you never call append, give it a constant time size, and don't return it.
      When you say it's always on the heap in go, do you mean that is how the main Go compiler implements it or is that part of the Go standard?

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci +1

      @@awesomedavid2012 I remember that this was the standard, although not in the language specification, since this doesn' t belong there, but this behaviour may change in future without changing the language specs. Somebody just told me that he looked at this more closely and that the current compiler does keep smaller slices that have properties you describe on the stack.

  • @Jarikraider
    @Jarikraider Před 6 měsíci +1

    Personally, I thought a 100ms improvement on 1 billion elements was pretty good.

  • @adambright5416
    @adambright5416 Před 5 měsíci

    The fact that he used time in one function instead of builtin benchmarks is just... Wow

  • @Nenad_bZmaj
    @Nenad_bZmaj Před 6 měsíci

    Something I noticed: you'll get slightly worse performance if you write involved expressions instead of simple indexes. Hint: assign such expression to a variable first and use that as an index. and let compiler optimize it. Simpler expressions are better put as an index directly, e.g. a[i+3], but write: ind := b[i]; do something with a[ind], rather than: do something with a[b[i]]. Not a big deal, though, since difference is marginal.

  • @blenderpanzi
    @blenderpanzi Před 6 měsíci

    I'm at 1:53 - I think the 2nd one could be the fastest, because an intelligent compiler could inline len() and eliminate bound checks in that case. It can't eliminate bound checks in the last version. Let's see if my intuition is right.

  • @rosehogenson1398
    @rosehogenson1398 Před 6 měsíci

    go has escape analysis which can decide to place slices on the stack if it can guarantee they don't escape the current scope

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      True for some types of variables, not sure if this is true for slices. I think they are always on the heap.

  • @metropolis10
    @metropolis10 Před 6 měsíci

    I liked that he said the 100ms improvement wasn't worth it. How often do people admit the technically faster thing isn't worth the headaches.

  • @Nenad_bZmaj
    @Nenad_bZmaj Před 6 měsíci

    Regarding some comments: I did try arrays instead of slices of length known at the compile time, but performance was always slightly worse. In my experience, Go programmers can forget the arrays, since slices seem to be very well optimized. And, I like using slices when I have to do something differently in the last iteration or in distinct subranges. So, instead of
    for i := 0; i < k; i++ {...}; for i := k; i < len(a); i++{...},
    I can write:
    b := a[: k]; c := a [k:]; for _, el := range b {...}; for _, el := range c {...}.

    • @flga_
      @flga_ Před 6 měsíci

      I'm not sure what you did, but when switching to arrays you need to be careful to avoid copies.
      There is no inherent reason for arrays to perform slower, other than that.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      @@flga_ I passed them as reference. Very small arrays, for instance, small hash tables with short slices of bytes as elements, or arrays of ints. But passing pointers to them apparently always triggers GC sweep. When I turn them to slices, the execution is faster overall for the duration of one GC sweep, which on my comp., for several programs I wrote, apparently (inferred) takes about 200 ms on average.

    • @flga_
      @flga_ Před 6 měsíci

      @@Nenad_bZmaj I don't think you were measuring things correctly then, I highly doubt gc is a factor.
      Passing a pointer to an array will require a nil check, that's the only difference and can be done by you, upfront, instead of by the compiler per iteration.
      Put in other terms, using arrays is asymptotically equivalent to using a 1d slice. Tho for small sizes the distinction is unlikely to be meaningful as the chance of it all being cached in either scenario is high.

  • @OnFireByte
    @OnFireByte Před 6 měsíci +1

    really interesting that flatten slice is a bit slower than actual 3d slice

    • @CjqNslXUcM
      @CjqNslXUcM Před 6 měsíci +2

      the slice is a pointer with a type and a len, so every time you move from the innermost slice it jumps to a cache miss. the flat slice is one fat blob of memory. no cache misses.

    • @OnFireByte
      @OnFireByte Před 6 měsíci

      @@CjqNslXUcM that’s why im surprised that flatten one is slower

  • @RenThraysk
    @RenThraysk Před 6 měsíci +1

    His problem is that Go compiler doesn't perform loop unrolling, this is a deliberate decision by the go devs. If you manually loop unroll he'd get an improvement.

  • @josefaguilar2955
    @josefaguilar2955 Před 6 měsíci

    These complications from loops is why I believe C/C++ will be around FOR a WHILE longer.

  • @MethodOverRide
    @MethodOverRide Před 6 měsíci

    Write smooth easy code? Sounds like something that genius Tom would do with JDSL.

  • @bbrodriges
    @bbrodriges Před 6 měsíci +1

    Slices can be stack allocated. If fact they are by default

    • @metaltyphoon
      @metaltyphoon Před 6 měsíci +3

      Not they r not. The stack type, which is a pointer + length, will be stack allocated but the data won’t be in the stack.

    • @antoniusupov
      @antoniusupov Před 6 měsíci +1

      All pointer values in go will go to heap. Go check out the runtime package in go and see what is inside. Only objects of fixed length live on stack, in go even strings contain pointers in them and go to heap.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      The so called slice-header will be on stack (basically 3 numbers). It always points to the memory allocation on the heap.

  • @mannycalavera121
    @mannycalavera121 Před 6 měsíci

    We Gorupies now

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff Před 6 měsíci

    08:39 "a.length = 0" doesn't create new object so less stuff to gc

  • @k98killer
    @k98killer Před 6 měsíci

    Go will store slices in stack memory if they are small enough (less than 2^16 bytes). Allegedly.

    • @masterchief1520
      @masterchief1520 Před měsícem

      Thought it all depended on escape analysis?

    • @k98killer
      @k98killer Před měsícem

      @@masterchief1520 I think it is a mix, but it has been a while since I last looked into this, so idk for sure.

  • @FunkyELF
    @FunkyELF Před 6 měsíci

    The python range() is not like ranges in other languages. Python range just gives back numbers, that's it. I think go's range is more equivalent to general iteration in Python and the "in" keyword... `for thing in things: ...`

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff Před 6 měsíci

    09:19 "while" is usually worse enemy

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff Před 6 měsíci

    08:55 "a[a.length]=1" is slower than "a.push(1)" in v8

  • @dejangegic
    @dejangegic Před 6 měsíci +2

    I guess Go really is K.I.S.S. from the bottom-up

  • @mohamedaityoussef9965
    @mohamedaityoussef9965 Před 6 měsíci +1

    Blue hair gone nooooooo

  • @reactjs1900
    @reactjs1900 Před 6 měsíci

    I think slices in go are like vectors and stored in heaps

  • @Arwahanoth
    @Arwahanoth Před 5 měsíci

    I guess range statement avoid bound checking.

  • @davidzwitser
    @davidzwitser Před 5 měsíci

    Meanwhile in APL languages you just write ‘matrix + number’ and it’s fully optimised

  • @justbrad_v3906
    @justbrad_v3906 Před 6 měsíci

    yeah, that sounds like buffer overflow hell lol

  • @arkie87
    @arkie87 Před 6 měsíci

    numba allows you to write fast python loops by compiling using llvm
    also, am I the only one who pronounced numpy in their head as num-pee for the longest time

  • @Zullfix
    @Zullfix Před 6 měsíci

    Rather than trying to optimize iterating, it sounds like his array needs to be split into partitions (or chunks if you play minecraft) to allow for multithreading his calculations.

    • @ErazerPT
      @ErazerPT Před 6 měsíci

      Not everything can can be parallelized. It all comes down to whether or not your next step depends on the results of the previous step. If no, you can, if yes, you can't. Simple example, a loop that steps through a vector adding the previous element to the current one in place. You can't parallelize it because you need the previous step's result before you execute the next.

    • @Zullfix
      @Zullfix Před 6 měsíci

      @@ErazerPT You don't parallelize by computing steps out of order, you parallelize by partitioning your space into chunks and computing each chunk on a separate thread, then combine the results. Obviously nothing is that simple, but at a surface level this is *the* technique used by professionals to create multi-threaded physics simulations and it very likely can be applied to whatever the guy in the video is trying to compute.

    • @ErazerPT
      @ErazerPT Před 6 měsíci +1

      @@Zullfix Read the example i gave you. It CAN'T be parallelized. You CAN'T partition things if ANY of the subsequent operations depend on the results of its predecessors. As for physics simulations, be it rigid body dynamics or fluid, search around and you'll find a simple "some parts can, some parts can't" answer. Think like this, in a perfectly inelastic collision of billiard balls in a perfect line, only the cue ball and the last ball will move. But the last ball can't know it will move before the "one before last" knows it should try to move. And the one before last won't know... etc. Things like for example "grass simulation" can be done IF you take some "artistic license", ie, for example assuming wind interacts with grass but grass blades don't interact with each other. THAT is when you can partition things, because wind is a common entity and each blade of grass does it's own thing.

    • @atijohn8135
      @atijohn8135 Před 6 měsíci

      @@ErazerPT for your example, you could still optimize it by adding up parts of the array in parallel, then for each part add the final element of the previous one. this would be faster for large enough arrays, since you can further parallelize the subsequent block adds no problem.
      many smaller dependencies like that can be eliminated by experienced parallel programmers, it's really more common than you seem to think, you can't just say that you can't parallelize an algorithm just because "not everything can't be parallelized"

    • @ErazerPT
      @ErazerPT Před 6 měsíci

      @@atijohn8135 Jesus Mary, you people really don't give up... YOU CAN'T parallelize that example (and be more efficient). THINK. If one thread is adding a[0] to a[1], another CAN'T add a[1] to a[2] because it DOESN'T know what a[1]'s final value will be before the first thread finishes.
      But let's say we ignore that, and have a cpu with unlimited (real) threads. So, one does a[0]+a[1],while another does a[1]+a[2], etc until a[n-1]+a[n]. When they all finished, you now have to run another batch, look at differentials from the previous ops, and do it all over again for a[1] to a[n]. But... it has a "carry effect", so, you have to do it again for a[2] to a[n], then for each element upto a[n-1]+a[n].
      Congratulations, you just parallelized something and ended up being less efficient than a single thread, because you still had to run n-1 steps, do more work per step and waste more memory because you had to keep track of both old values and new to calculate differentials. Great job.

  • @gavipk
    @gavipk Před 5 měsíci

    solution: c++

  • @someman7
    @someman7 Před 6 měsíci

    Whoa, his parentheses are different colors. It only took us what, 30 years to see "such" progress in syntax highlighting?

  • @AlexanderNecheff
    @AlexanderNecheff Před 6 měsíci +1

    I used to believe the compiler knew best and especially that any bug in the code was my fault, never the compiler's fault: then I started having to use the Intel compilers...
    And oh boy howdy - some of the worst compilers ever. Poor standards compliance, lots of weird bugs in both the front and back end, horrible diagnostics to boot.
    Still generally think the compiler usually knows best, but now I have an disclaimer associated with that rule of thumb.

  • @GermanClaus
    @GermanClaus Před 6 měsíci +2

    Try out some Kotlin ;)

  • @martinpata2899
    @martinpata2899 Před 6 měsíci

    I like the idea but he should be using go benchmark for this tests

  • @yoloswaggins2161
    @yoloswaggins2161 Před 6 měsíci +2

    Can't you just disassemble the binary and look at what they're actually doing?

  • @lancemarchetti8673
    @lancemarchetti8673 Před 6 měsíci

    Nice

  • @alexpyattaev
    @alexpyattaev Před 6 měsíci +1

    1 billion elements... The problem is likely memory bound. Make use of smaller data type, it will run faster.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      Yeah. Also the ratio of used to total memory influences how frequent GC will run. (kinda; it is more complicated). GC sweep for this size typically takes 200 ms (on my old comp with old single core Athlon CPU).

    • @alexpyattaev
      @alexpyattaev Před 6 měsíci +1

      @@Nenad_bZmajGC does not traverse an array of integers. As far as GC is concerned, it is one "leaf" object in the object graph, so it does not get traversed, no matter how big it is. Now if it contained other objects like strings the GC would have to walk it.

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      @@alexpyattaev I am not talking about GC traversing an array, but about the GC mark-and-sweep cycle occurring during the execution at some time points, at which times "the world is frozen". The way he structured this test - you don't know when the GC was active, so benchmarking three test-cases consecutively in one outer loop is not going to be informative for comparisons...
      I mentioned size dependency, because the program runs longer on larger arrays and GC runs more times. I see that in my programs. Although these programs involve quite complex data processing, the mere size of input array dictates how many times GC will run or for how long in total it freezes the world. When my input slices point to array of 50 Mbytes, one GC sweep takes about 50 ms on average, and when the array is 250 Mb - 250 ms. (on my very old comp.)
      I assume that when each innermost slice is done its array can be marked for sweeping since it is not going to be used anymore, and so on upwards.

    • @alexpyattaev
      @alexpyattaev Před 6 měsíci

      @@Nenad_bZmajOh I see, with more memory allocated in general it just runs GC more often. That is crazy, so the more RAM you use the slower your code runs (as it does more pauses)? Brrr....

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      @@alexpyattaev It is highly optimized, though, so it uses the least number of cycles possible for the amount of garbage it needs to collect. Plus, Go gives you options for GC so you can optimize it for your specific program during the build.
      If your program approaches the memory limit, Go GC will take that into account when calculating how often must sweep, and adjust its cycles so you don''t run out of memory. You can opt out of this behavior. When you have enough memory, GC doesn't take a large fraction of total time even if you have a lot of garbage. If you have a miltithreaded CPU, then you almost won't notice the GC.

  • @lukekurlandski7653
    @lukekurlandski7653 Před 6 měsíci

    These are dynamic arrays, right? I feel like you should be able to call a .contiguous() function or something that moves all of the elements of the tensor into a contiguous arrangement. Then again, I’m guessing there is some other data structure in Go that is closer to a traditional array or something…?

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      Yes, array : a := [k]something, as opposed to slice- a:= make ([]something, k). But in my experience, slices perform better.

  • @jony1710
    @jony1710 Před 6 měsíci

    ITT nobody knows what the stack is

  • @GaborGyebnar
    @GaborGyebnar Před 6 měsíci

    Case 5 is fastest because the code is buggy: y is never used. It's just reading the same row multiple times, cache hits make it fast but the result is incorrect.

  • @huge_letters
    @huge_letters Před 6 měsíci

    wouldn't it be faster to access an underlying list at each loop layer?
    instead of
    loop x
    loop y
    loop z
    t += arr[x][y][z]
    you do
    loop x
    xa = arr[x]
    loop y
    ya = xa[y]
    loop z
    t += ya[z]
    that way you do access on x and y only when needed and not every iteration? If you're saying it's likely memory access that's the issue

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      No difference. You still have to access the element of the array via xa. xa and arr[x] are both just pointers. And the array is one same. The compiler will transform your code into the one in the video, anyway, because xa and ya are just transient variables of slice (pointer) type.

    • @huge_letters
      @huge_letters Před 6 měsíci

      @@Nenad_bZmaj I mean that in option 1 you do access by x x*y*z times while in option 2 you do it only x times.
      Same for y - y*z times vs y times

    • @Nenad_bZmaj
      @Nenad_bZmaj Před 6 měsíci

      @@huge_letters Oh, I wish. But simply not possible. No matter how you write it, the array contains x*y*z elements, and you need to access them all.

  • @nonaligned293
    @nonaligned293 Před 6 měsíci

    react equals pain

  • @dickheadrecs
    @dickheadrecs Před 5 měsíci

    uh excuse me i prefer my memory malloc carte

  • @bloomingbridges
    @bloomingbridges Před 6 měsíci

    🐼

  • @dixztube
    @dixztube Před 6 měsíci

    Man the new react next is sucks! I just popped out an htmx and go project. Silky smooth, fast and sexy. Been bumming around this react nonsense for two months now after getting decent at the last version of react

  • @nevokrien95
    @nevokrien95 Před 6 měsíci

    Pandas is slow use numpy

  • @ofekshochat9920
    @ofekshochat9920 Před 6 měsíci

    arr[a][b][c] = arr[a * (second_dim + third_dim) + b * third_dim + c]. this shouldn't have hops...

    • @leeroyjenkins0
      @leeroyjenkins0 Před 2 měsíci

      They're slices so you can append to them and thus they can move. They're not contiguous memory. If you want contiguous memory you use fixed-size arrays.

  • @ryankimeu9013
    @ryankimeu9013 Před 6 měsíci +1

    i am an arch user

  • @BlitzDjingga
    @BlitzDjingga Před 6 měsíci

    This is really shocking result, but I do believe the second one is better approach most of the time, while the third one is super optimized that it is so tiring to code and maintain.
    To bad the video does not show memory performance for this, I believe what is happening is that the second and third one operate on HEAP does make the performance slower. While the first one is lucky enough not to encounter memory overflow. thing is when you do `range` it make copy of the object, albeit 3D array maybe the host machine having ram big enough to account for all of that iterating.
    Will be more educational if can show the garbage collection and memory / space performance for each of the method.

  • @pyaehtetaung
    @pyaehtetaung Před 6 měsíci +1

    Black?

  • @deathzombee
    @deathzombee Před 6 měsíci +1

    First

  • @spartan_j117
    @spartan_j117 Před 6 měsíci

    Hahahaha Prime! So you DO reupload your old videos! Now that's lame...

  • @LengCPP
    @LengCPP Před 6 měsíci

    admit it you watch your own videos

  • @svetlinzarev3453
    @svetlinzarev3453 Před 6 měsíci

    Golang is such a a crappy language

  • @adjbutler
    @adjbutler Před 6 měsíci

    But Python is strongly typed and JS is weakly typed... why you hate on Python so much? It is not as slow as it used to be....