Apple M1 Ultra & NUMA - Computerphile

Sdílet
Vložit
  • čas přidán 7. 06. 2024
  • Apple's latest M1 chip is two older chips bolted together, Dr. Steve Bagley explains how they made it work the same as a single chip.
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottscomputer
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Komentáře • 389

  • @50PullUps
    @50PullUps Před 2 lety +717

    This entry is pure gold. Please make more vids where the latest tech is a jumping off point for the main topic.

    • @Stopinvadingmyhardware
      @Stopinvadingmyhardware Před 2 lety

      Where did I apply for that?

    • @oskrm
      @oskrm Před 2 lety +12

      That's the thing, this is not the latest tech.

    • @nezbrun872
      @nezbrun872 Před 2 lety +4

      NUMA's not new, it's been a facet of multi socket Xeon systems for many years for example, and other architectures too before that. The battle has always been to make the interconnect interfaces (QPI/UPI in Intel speak) as quick as possible to maximise performance. Software like RDBMSs are NUMA aware to optimise workload across sockets (and hence memory domains).

    • @darkidz24
      @darkidz24 Před 2 lety +1

      It could really take this channel to the next level!! Explaining modern day tech

    • @SilentlyContinue
      @SilentlyContinue Před rokem

      Yes! Helps with understanding real world application.

  • @TechTechPotato
    @TechTechPotato Před 2 lety +314

    Intel's EMIB, similar to ultra fusion, in Sapphire Rapids adds additional latency of 5-8 nanoseconds. This makes the core-to-core latency go from 54 worst case to 70 worst case. Apple's situation is similar, with similar bandwidth per connection. We expect the latency to be an additional 5-8 nanoseconds also. Ultrafusion is using TSMC's InFO_LSI manufacturing.

    • @Joseph_Roffey
      @Joseph_Roffey Před 2 lety +70

      But the difference is one is called “random string of letters” and the other is called “Ultra Fusion” 😍

    • @eddyecho
      @eddyecho Před 2 lety +44

      @@Joseph_Roffey huh? More like one is a "stupid marketing name that really doesn't describe the underlying mechanism" and the other is called "embedded multi-die interconnect bridge"

    • @landspide
      @landspide Před 2 lety +18

      @@Joseph_Roffey And begins with "We call this..." and is filled with "... only at Apple can we ..."

    • @shunyaatma
      @shunyaatma Před 2 lety +3

      Any numbers for AMD (Zen 2 and 3) 2-socket systems with and without xGMI cables?

    • @egor1g
      @egor1g Před 2 lety +14

      yeah, but it is ARM vs x86, 256 channel memory against 6 and also efficiency cores, also video memory... so not really the same!

  • @doctorpex6862
    @doctorpex6862 Před 2 lety +3

    Netflix gains most of speed by "video is not available in your country"

  • @NinjaAdorable
    @NinjaAdorable Před 6 měsíci +1

    This has been one of the most intuitive and elegant explanations for NUMA I have ever heard!! Kudos

  • @BenjyP.
    @BenjyP. Před 2 lety +26

    I read ml instead of m1 so I thought this would be a video of how the neural cores work. I would love a video on how to use the apple neural cores for machine learning as they already take up 20% space of the entire chip

  • @prla5400
    @prla5400 Před 2 lety +5

    Back to you, Steve

  • @markholm7050
    @markholm7050 Před 2 lety +66

    Can one still purchase green lined, perforated line printer paper or are you working off an old stock? That stuff was great for physics homework. Worked pretty well in line printers, too.

    • @sajukkhar
      @sajukkhar Před 2 lety +5

      Dot matrix paper is still sold.

    • @rabidbigdog
      @rabidbigdog Před 2 lety +22

      I'm convinced there is warehouse in Nottingham that is full of nothing but that tractor paper, just for Computerphile.

    • @davidgillies620
      @davidgillies620 Před 2 lety +5

      You can buy a couple of thousand feet of the green ruled stuff for about forty quid from any wholesale stationery supply store.

    • @arpanmajumdar617
      @arpanmajumdar617 Před 2 lety +12

      I think they are still available at Dunder Mifflin.

    • @heisen9460
      @heisen9460 Před 2 lety +2

      @@arpanmajumdar617 lol

  • @paulledak291
    @paulledak291 Před 2 lety +195

    Nice explanation of how NUMA architecture is implemented. However, you stated that the reason for moving to this architecture is because as you add more and more cores, you increase the probability of memory collisions. But then you completely forgot to explain how having 2 memory banks reduces the probability of the memory collisions that you would still get as you add the more processors. It would seem to be the most essential element needed for this video which is completely missing. (Yes I understand that now there are 2 memory banks with twice the bus bandwidth but this is never explained. And there are different interleaved memory architectures which could increase the memory bandwidth without resorting to NUMA)

    • @bberakable
      @bberakable Před 2 lety +2

      Agree 100%

    • @mytech6779
      @mytech6779 Před 2 lety +21

      Its not bandwidth at issue, simultaneous access is the issue, this allows the banks to be accessed in parallel. Its like using a network bridge to make two ethernet sub-nets. Which I just realized is a really outdated reference as nobody uses shared media networks anymore.
      But basically all computers on a subnet could hear all packets on that subnet as it was physically one solid wire, and as more nodes were added you would get more chance of collisions and congestion,(non-linear increase) so chop it in two with a bridge(like a filter of sorts) so only about half of the total traffic can be seen, because only packets addressed to the other subnet are passed through the bridge.

    • @Sandeep-cz7ls
      @Sandeep-cz7ls Před 2 lety +2

      @@mytech6779 wait im still confused, how does this allow the banks to be accessed in parallel? is it due to the interconnect?

    • @valshaped
      @valshaped Před 2 lety +9

      @@Sandeep-cz7ls Each bank can be accessed by one CPU at a time
      More banks -> more CPUs at a time

    • @MaulikParmar210
      @MaulikParmar210 Před 2 lety +8

      @@Sandeep-cz7ls to keep it simple in modern day CPUs or lets say CPU cluster - there's memory controller inside each CPU cluster that makes request on behalf of physical CPU die, but in numa, there are multiple clusters acting on it's own so there are multiple access point to access different or same memory banks by different cpus.
      When two controllers try to access same bank and location it is going to be parallel access and cause lot of data inconsistencies when read and write at same time from different CPUs, unless it is handled on software level so that software is aware of such architecture. OS knows memory space and kernel is generally responsible to make sure each cpus request is translated in proper order and proper physical location by making use of translation tables or other hardware means that boosts this process depending on what's available. In NUMA these are much complex as each node has to communicate and coordinate exactly what they need, that's where connecting febric comes in, which provides crucial functions to get data in and out of foreign clusters.
      Keep that in mind when we talk about software, it's mostly OS level softwares and not consumer APIs, as consumer APIs make abstraction of these traits, your software would never know or has to care, if it's running on 1 core, 4 core or 12 cores 2x CPU sockets, in the eye of usespace resources are unified, unless you want to optimise then ofc you can request system to allocate memory near resource, that's the job of OS to maintain and abstract hardware and allow controlled access via syscalls or driver APIs.

  • @TheMrKeksLp
    @TheMrKeksLp Před 2 lety +7

    Mordern CPUs are only Harvard architectures in the most pedantic classification. Instructions are still kept in main memory, they just have a separate level 1 instruction and data cache. Even level 2 and 3 are shared...

  • @petrilaakso7927
    @petrilaakso7927 Před 2 lety +1

    Excellent explanation of NUMA, excellent work🙏🏼

  • @jaopredoramires
    @jaopredoramires Před 2 lety

    The camera and lighting on this one looks incredible

  • @grahmn886
    @grahmn886 Před 2 lety +1

    Lesson of the day, Thanks as always Steve :)

  • @SaiPhaniRam
    @SaiPhaniRam Před 2 lety

    Excellent presentation .. Simple and easy to understand 👏

  • @kuroexmachina
    @kuroexmachina Před 2 lety

    this channel is gold. always has been

  • @as-qh1qq
    @as-qh1qq Před 2 lety +105

    Why does making the interconnect (distributed shared memory) super-fast not bring back the original problem that we were trying to solve - increased memory access collision with increased CPUs? After all, if far away CPUs can access memory in nearly the same time as the nearby ones, how is it any different than just one memory with all near and far CPUs connected to it ?

    • @ssvis2
      @ssvis2 Před 2 lety +11

      It probably would reintroduce the problem. However, I would suspect there is some trickery under the hood of the OS working with the hardware to optimize data locality to keep the data on the "near" memory for any core. It's possible that part of it is memory mapping in the data interconnect so that memory on the "far" chunk could still be viewed as local to a core, and the super fast interconnect effectively negates the performance penalty that a traditional NUMA system would have.

    • @samuie2
      @samuie2 Před 2 lety +20

      I agree that it was not super clear in the video. I think you could still have that issue. however, it happens half as often since you have 2 banks of memory.

    • @davidgillies620
      @davidgillies620 Před 2 lety +6

      I would guess that it means you don't _have_ to tune data affinity (which makes development/deployment easier and therefore cheaper) but you _can_ if you want (which gives you the benefits of an optimised NUMA configuration).

    • @ssvis2
      @ssvis2 Před 2 lety +3

      @@davidgillies620 I'm thinking the same thing. By optimizing specific parts of the system, Apple has theoretically designed something that will perform really well in 99% of use cases. There's always more performance to squeeze out, but with severely diminishing returns.

    • @gajbooks
      @gajbooks Před 2 lety +4

      UltraFusion is really just a memory... Fusion. Their memory gets twice as fast since they have twice as many banks, they just need a way to combine the M1 chips so that both of them can use the other's memory at high speeds. There was probably some tradeoff with the memory controller or packaging which made them need 2x64 rather than having external 128 GB. I imagine their real Mac Pro replacement will have external memory and GPU.

  • @user-cc8kb
    @user-cc8kb Před 2 lety +1

    Great explanation. Thanks!

  • @ipurelike
    @ipurelike Před 2 lety

    thanks for the technical explanation!

  • @aipsong
    @aipsong Před 2 lety

    Excellent, instructive video - thanks!

  • @JJ-fq3dh
    @JJ-fq3dh Před 2 lety

    Great video, brings back memories of codiding on an sgi origin 2000 and irix

  • @danielsilva158
    @danielsilva158 Před 2 lety +7

    Would’ve been good to touch on how this memory system interfaces with the gpu!!

  • @shaneclk9854
    @shaneclk9854 Před 2 lety

    Excellent video

  • @Derbauer
    @Derbauer Před 2 lety

    Nicely explained!

  • @Sierra-Whisky
    @Sierra-Whisky Před 2 lety +2

    What an excellent explanation! And what a coincidence too. I tried to explain NUMA and the potential performance hog on the exact same day this video was published but obviously my explanation was nowhere near as clear as this one. 🤣
    Thanks! I'll share it with my colleagues.

  • @sholinwright6621
    @sholinwright6621 Před 2 lety +6

    Don’t you still have to write code to distribute the memory hits across the two memory banks or you just have the same multi core stalling effect mentioned earlier. The speed up was the ability to partition core memory fetches into two batches preventing all of the cores stalling trying to fetch from the same bank. Side note: I work on a radar with 11 cpu cards with an 88000 on each and 2 MB of local ram with the collection tied to 2 global memory cards with 8 MB.of ram. GRAM memory fetches are really expensive.

  • @OscarBerenguerPV
    @OscarBerenguerPV Před 2 lety

    This was a great video

  • @AL-vc9xc
    @AL-vc9xc Před 2 lety

    Wow very well and simply explained. Not in a math profession. But I did understand this write well! Thank you!!

  • @vernonthomas6554
    @vernonthomas6554 Před 2 lety

    Love your channel.

  • @tomdchi12
    @tomdchi12 Před 2 lety +5

    Doesn't Apple provide the compilers (and IDE) so couldn't they be baking in the modifications to the code that is required to manage the non-uniformness of memory access times? (Regardless, early benchmarks indicate that performance is scaling only a little short of linearly with the number of cores, so we can infer that memory access across the two halves of the "fused" CPU isn't creating major delays.)

  • @KipIngram
    @KipIngram Před 2 měsíci

    It's worth noting that the PCI ports are usually also split into these two domains, so you want to take that into account as well.

  • @RegitYouTuber
    @RegitYouTuber Před 2 lety +1

    Favourite bit of this was the chaotic side-angle crash zoom - really compliments the desperate addition of “well of course it’s more complex than this, but” that seems necessary these days

  • @user-cx2bk6pm2f
    @user-cx2bk6pm2f Před 2 lety

    Finally!! I understand NUMA.. thank you !

  • @circuitgamer7759
    @circuitgamer7759 Před 2 lety +1

    Video idea (because I don't know where to look for this) - some of the finer details of caching implementation. I understand the idea behind caching, and the structure behind it, but not how it's actually implemented. I want to learn the actual control logic for reading/writing cache lines, and when and how it gets updated to/from RAM or a higher level cache. Do the CPU cores control the caches directly, or is there some control logic for each cache that isn't a part of a specific core?
    I think it would be an interesting video, but if there's already one that exists that I missed, can someone reply with a link? I've only been able to find high-level explanations so far.

  • @qwertypnk9401
    @qwertypnk9401 Před rokem

    Nice, good job!

  • @dembro27
    @dembro27 Před 2 lety +4

    Cool stuff. But now I have "Numa Numa" in my head...

  • @henrikjensen3278
    @henrikjensen3278 Před 2 lety +2

    Good explanation, but I would like some explanation about write/read, i.e. two threads reading and writing to the same memory location. This would be easy enough to handle between the two sides, but what two cpus on the same side with their own cache, it sounds like a lot of circuit to handle that.
    Are there some smart solutions?

    • @ClarkCox
      @ClarkCox Před 2 lety +2

      That is indeed a problem that must be contended with. Look up "cache coherence"

  • @nameunknown007
    @nameunknown007 Před rokem

    Love you man!

  • @mysteriousm1
    @mysteriousm1 Před 2 lety +4

    Was there an earthquake during filming or why is it so shaky?

  • @kelvinluk9121
    @kelvinluk9121 Před 2 lety +1

    is it possible to address the ram access conflict issue between different cpus by introducing more memory channels?

  • @SimonJentzschX7
    @SimonJentzschX7 Před 2 lety +4

    Great video. I learned something new! Just one question: Could the operating system optimize my code when exexcuting? So when I allocate memory, the OS should know which CPU this process is running and allocate the memory in a RAM faster to access. This way the code does not need to change, just the OS.

    • @mr_waffles_the_dog
      @mr_waffles_the_dog Před 2 lety +4

      OS's already tend to do this :D
      The problem is what happens when you have multithreaded code (e.g. running on multiple cores/cpus at once), there is no one ideal block of memory for the OS to allocate to. The Apple claim is that their system is non-NUMA, or at least sufficiently fast to be indistinguishable, so developers don't have to rearchitect things to maximize performance.

  • @IceMetalPunk
    @IceMetalPunk Před 2 lety +17

    Apple: "M1 ULTRA FUSION!"
    Reality: "It's a fast wire junction."

    • @G5rry
      @G5rry Před 2 lety +10

      Reality: No, it's a bit more than that.

    • @RunForPeace-hk1cu
      @RunForPeace-hk1cu Před 2 lety

      If it’s so easy everyone would make 10TB/s interconnect 😂
      It’s a lot more complex than that.

    • @giornikitop5373
      @giornikitop5373 Před 2 lety

      @@RunForPeace-hk1cu it IS actually fairly straightforward to make a 10TB/s interconnect. but the cost is beyond crazy. besides, your need a cpu of such power to take advantage of it, so the cost makes even less sense. so the reason is not they cannot make it, the reason is they don't need to, at least not yet.

  • @torb-no
    @torb-no Před 2 lety

    In the Fujitsu A64FX is the CMG (Core Memory Group) like one of these groups talked about in the video? So if you’re on one node in one of them, trying to get data from memory connected to another CMG will be slower?

  • @wile123456
    @wile123456 Před rokem +1

    Maybe you've done it before but I would love a video explaining video games vs rendering/productivity workloads.
    Games get a big performance boost with more cache, the 5800X3D 8 core cpu increased performance a lot from over doubling level 3 cache with 3D stacking. But why does it mostly only benefit games and not other workloads?

  • @bosco4533
    @bosco4533 Před rokem

    I love this channel. /message.

  • @salmiakki5638
    @salmiakki5638 Před 2 lety +13

    *It's only the firsts two generations of threadripper CPUs that have 2 NUMA nodes.
    The last one and both generations of threadripper pro have unified the memory Access

    • @romevang
      @romevang Před 2 lety

      Threadripper 2990wx has 4 NUMA nodes. 2950x i think has 2.

    • @salmiakki5638
      @salmiakki5638 Před 2 lety

      @@romevang thanks, i though I remembered it was the same throughout the range

  • @iammakimadog
    @iammakimadog Před 2 lety

    Thank you!

  • @SproutyPottedPlant
    @SproutyPottedPlant Před 2 lety +1

    That was great! When you showed the bus arbiter it reminded me of the Sega Mega Drive! It’s got one of those??

  • @Yoda2000ful
    @Yoda2000ful Před 2 lety +1

    Amazing, I wish I had a teacher like you on the microprocessor classes at my degree❤️

  • @gorunmain
    @gorunmain Před 2 lety

    This is great!

  • @bentationfunkiloglio
    @bentationfunkiloglio Před 2 lety

    Quite informative.

  • @JohnnyWednesday
    @JohnnyWednesday Před 2 lety +19

    Thank you kindly Dr. Bagley for sharing your knowledge with us. I'm quite surprised that Intel and AMD have not yet pushed for on-die memory given the M1's impressive demonstration

    • @SimonVaIe
      @SimonVaIe Před 2 lety +12

      It does have some negative consequences. More expensive to produce, not expandable, if one thing breaks the whole thing is broken. I also don't know how much expertise would be required in ram design/production (keep in mind that Apple is far bigger than intel, which is far bigger than AMD) seeing there is a very well established ecosystem of memory manufacturers (they do have quite extensive cache systems on their CPUs already, don't know how well that translates). And not every task profits as much from faster ram. No idea if those are major reasons for amd and intel, but like for everything else it's just a matter of finding what best fits a job.

    • @dotted1337
      @dotted1337 Před 2 lety +12

      On-die RAM is rather limiting, so it wont really work well for either AMD or Intel to make such a product as such kind of RAM is much too slow, in terms of both bandwidth and latency, for use as a cache or if used as RAM you'd have the same problem as this video is talking about. But Intel had the i7-5775C back in 2015 with 128MB of EDRAM for the on board GPU, but was also used as a L4 cache, and Intel's upcoming Sapphire Rapids Xeon will have a version with 64GB on-package HBM2E with a bandwidth of well over 1TB per second. And finally you have AMD with their V-Cache supposedly having a bandwidth of about 2TB per second. tl;dr Apple can do on-die memory because they know exactly who their customers are and can make almost tailor made SoCs for them, where as AMD and Intel has customers much too diverse to make on-die memory viable.

    • @JohnnyWednesday
      @JohnnyWednesday Před 2 lety

      @@dotted1337 - Thank you for your detailed reply, I was unaware of the I7-5775C - that smells like it could have been designed for use in a console given the perceived similarity to previous xbox memory layouts. It is my understanding that a large part of the M1s 'boost' above other ARM designs is the lower latency access to system memory?
      Perhaps naive but if such performance can be gained for an ARM chip, then should not a similar ratio of performance be seen with a similarly designed x86 chip?
      With ultra-fast streaming devices and multi-channel pardigms like the PS5's SSD controller? could we not see a slowing of average memory capacity for users? perhaps the time for a fixed 16gb of memory on a CPU is now? especially given the console generations are locking game engine technology advancements for years at a time?

    • @harshpatel9020
      @harshpatel9020 Před 2 lety +3

      I think this is because they uses DDR in their desktop models (and not laptops because laptop come in both)and not lpddr as used in apple's
      M1 line up.
      In mobile processor where DDR and LPDDR , both are being used - ram is mounted on the pcb.(these are soldered on motherboard and not on die itself as you said is in the case of apple)
      Note - many things I said may turn out to be wrong so it will be better if one cross checkes things first before getting any conclusion.I would be happy to know where I am wrong and Learn something new. Thank you)

    • @mytech6779
      @mytech6779 Před 2 lety +1

      On die memory is called L1 cache, sometimes L2 and L3 care often placed on die as well. In fact over 80% of late generation CPU silicon area is taken up by on-die memory.
      (NB4: yes the 386 had off-die L1, but it was 1986)

  • @itsMunchkin
    @itsMunchkin Před 2 lety

    Waw! Powerfully explained.

  • @Xiaomi_Global
    @Xiaomi_Global Před 2 lety

    How about the same architecture but different fab interconnect process? Does it affect performance?

  • @Hooorse
    @Hooorse Před rokem

    Thank you

  • @tomahzo
    @tomahzo Před 2 lety

    Nice video! One question would be how much of this is done purely in hardware and how much is informed by the compilers, system frameworks and the OS as a whole. Apple has the advantage that they build the hardware and the full OS stack whereas players like Intel and AMD cannot pick and choose between what OS:es they want to support so whatever they do must be fully realized in hardware. (although, the OS vendors do need to support the hardware features that they offer) So does this mean that Apple uses some system software tricks to accelerate the interconnect and to reduce the latency, maybe through the way that the CPUs and their associated memory are partitioned and how threads are scheduled across the cores to minimize traffic through the interconnect?

  • @michaellatta
    @michaellatta Před 2 lety

    I would guess ram attached to each die and cache is on that die. Interconnect used for off-die access to the other die’s cache/ram.

  • @RAJATTHEPAGAL
    @RAJATTHEPAGAL Před 2 lety

    Another hypothesis is Apple's Rosetta , layer possibly working to tranlating instruction to accomodate the memory layout. Perhaps tapping in between the OS Kernel level calls and application layer to translate the memory allocation and instruction placement to be co-located in the same memory. I mean Roseta emulation is fast , won't be surprised if they use it for this purpose. won't be a silver bullet but a bullet they may add for solving the memory placement issue.

    • @magicmark3309
      @magicmark3309 Před 2 lety +1

      I wouldnt think so. Rosetta only installs once you install software that can’t natively run on M1. I think that’d be adding too much overhead to an already somewhat costly translation layer. Although I’ve seen it really depends on the particular software.
      It also helps that Apple has a very large piggy bank for their RND and that they plan everything so far off. Hence why iPhones are just now getting high refresh rates. Hopefully this will give new life to competition I. The market.

  • @johongo
    @johongo Před 2 lety +1

    I want to learn more about this stuff, but it seems very distant, even as someone who programs for work. Any advice?

    • @MrPBJTIME12
      @MrPBJTIME12 Před 2 lety +1

      Computer Organization & Architecture - William Stallings

  • @Benny-tb3ci
    @Benny-tb3ci Před rokem

    We, the people in chemistry and any other science that relies heavily on chemistry, have a very nice phrase for these kinds of things. It's called the "rate-limiting step" (in a chain of reactions).

  • @edmondhung6097
    @edmondhung6097 Před 2 lety

    But what is more important in this NUMA case? latency or bandwidth? And to push the performance to absolute limit, is it still better to use local memory instead of remote even apple promoted the interconnection have more bandwidth than its memory bandwidth

  • @kriptofinans2864
    @kriptofinans2864 Před rokem

    Very clear thx :)

  • @bmitch3020
    @bmitch3020 Před 2 lety

    Is this at least part of the reason motherboard instructions specify which slots should be used for various numbers of RAM chips?

    • @R3BootYourMind
      @R3BootYourMind Před 2 lety +1

      No, the slots are numbered because of how ram is accessed in parallel. Dual channel memory is usually the maximum consumer cpus can handle and the dual part is electrically wired to work best with some slots. using the "wrong" slots would either make two ram sticks work in single channel mode or dual channel but in the longer traces. The slightly longer traces are the nonpreferred slots that are used when 4 sticks are in use and can effect memory overclocking results.

  • @sevilnatas
    @sevilnatas Před rokem

    Does Computerphile often use greenbar paper for their illustrations, because they are still using greenbar a lot, so it is handy, or is it because they don't use it anymore, so they have a bunch of it sitting around, unused, so they might as well use it for illustrations?

  • @caffedinator5584
    @caffedinator5584 Před 2 lety

    My naive understanding of cpu architecture leads me to believe that the core to core memory interconnect is the lesser of the problem vs the GPU core kernel/instruction execution.
    Do you have any insight into that?

  • @asmerhamidali9679
    @asmerhamidali9679 Před 2 lety

    Please make some videos on RISC-V. Lately it has been a hot topic.

  • @debojitmandal8670
    @debojitmandal8670 Před 2 lety

    Wait but apple isnt using a distributed shared memory Like u mentioned.
    But rather a cpu from one block can access the memory of other cpu block directly without even going through the distributed shared memory lane atleast that what i have understood from their presentation.
    There is no middle man like the shared distributed memory lane as u mentioned.
    Please correct me if i am wrong

  • @1idd0kun
    @1idd0kun Před 2 lety +1

    No matter how fast the interconnect is, it's never gonna behave like a UMA system. If a core in die 1 tries to access the memory pool attached to die 2, there will be a latency penalty. We won't know how big that latency penalty is and how much of an impact in performance will have until the system is properly tested. I'm hoping Anandtech will test it since they usually do memory latency tests.

    • @bobo-cc1xw
      @bobo-cc1xw Před 2 lety

      Ian cutruss formerly of anandtech said above 5 to 7 NS for just interconnect Vs 54ns total. So call it 15 percent more latency

  • @peterhindes56
    @peterhindes56 Před rokem

    Why have a memory interconnect at all then? Unless this was not intended to solve the problem mentioned about memory access getting clogged up.

  • @tcornell05
    @tcornell05 Před rokem +3

    This might be the most informative video i've come across in years on youtube. You have an amazing way of articulating topics like this to the ADHD & Dyslexic programming community, like myself xD. Now I'm dying for a fellow up on how exactly they managed to make the distributed shared memory link so fast. Any resources you recommend?

  • @genhen
    @genhen Před 2 lety +2

    I've always wondered if we accessed more than one NUMA nodes worth of memory, how does the memory get chunked up? Take half and half? Take most from one? Is it hardware dependent? Software/OS dependent?

    • @katbryce
      @katbryce Před 2 lety +1

      On my Threadripper motherboard, there is the CPU, and either side of it, there are four memory slots for a total of 8. The 4 slots on one side are one NUMA node, and the 4 slots on the other side are the other NUMA node.

    • @PoseidonDiver
      @PoseidonDiver Před 2 lety +1

      Also, there is no true virtual to physical CPU affinity. And the hypervisor generally allocates the compute to the VM as needed, when running performance graphs you can see big spike across the sharing CPUs when its allocating compute from another node. (hope that actually answers your question :p )

  • @kirtanmusica1999
    @kirtanmusica1999 Před 2 lety

    Namaskar gracias por la educación, gracias por la luz

  • @roryskyee
    @roryskyee Před 2 lety

    very technical

  • @jurabondarchook2494
    @jurabondarchook2494 Před 2 lety

    Hmmm.
    But if you make distributed shared memory system super fast, you will end up with the same problem as in the beginning.
    When distributed shared memory system need to access memory, CPUs attached to that memory have to wait, aren't they?
    So probability of collision increases again.

  • @marklonergan3898
    @marklonergan3898 Před 2 lety +4

    Maybe i'm not understanding the problem correctly, but couldn't you just have a rudementry controller sitting between the 2 that uses the most-significant bit of the address to determine which ram chip has the data? That way by having the controller between the chips and as the central access point, all queries would take the same amount of time to fetch the data. By having this logic at hardware level you would have minimal latency added.
    I know this would only work on chips that are the same size but you could combine composites with singles (i.e. 2x 32s connected with a controller could be combined with an actual 64 with a controller)

    • @Addlibs
      @Addlibs Před 2 lety +6

      This suffers the same slowdown which result from physically separate RAM locations, close to individual groups of CPU cores but not as close to others; even if the most significant bits picked the RAM module without any fancy chips in the way, fetching data from a CPU farther down the line is going to be generally slower, and it's easy to double or triple the tiny amount of time it takes to fetch data with computers this compact and fast, that is, 4 nanoseconds is twice as long as 2 nanoseconds -- both are incredibly fast though.

    • @katbryce
      @katbryce Před 2 lety +1

      @@Addlibs Remember that a 4GHz CPU completes 4 instructions every nanosecond, and in a nanosecond, light travels about 30cm. Electricity is slower, so any round trip of more than about 3cm isn't going to happen within a clock cycle.

  • @newburypi
    @newburypi Před 2 lety +7

    Think I missed something here. Totally got the "was slow but Apple made it fast." However I think there's a promise of "won't need to change the software." The NUMA method requires knowledge of which memory block has the desired data. Hence, a change to software. So... did they also build a way to hide the fact of two memory blocks?

    • @elliott8175
      @elliott8175 Před 2 lety +11

      The reason NUMA systems usually require the software developers to be aware of the positioning of CPUs and memory is because of the slower speeds when fetching data from memory that is farther away. However, the new M1 chip claims to make fetching data fast enough for the worst-case RAM position to still not cause any slow-down.
      I assume this means that the difference in time to fetch memory that is close, compared to memory that is far away, is less than a clock cycle. So from the core's point-of-view they have the same latency.

    • @newburypi
      @newburypi Před 2 lety

      @@elliott8175 great. Thanks for the clarification. Thought I missed something.

    • @RunForPeace-hk1cu
      @RunForPeace-hk1cu Před 2 lety +2

      @@elliott8175 the “trick” is literally the hardest part that no one could solve 😂

  • @jaffarbh
    @jaffarbh Před 2 lety

    Things get more complicated as we use cloud based virtual machines and we have no idea (often) about the underlying hardware and architecture. If I recall, VMware hypervisor dynamically reallocates memory blocks to optimise for more (localised) access so that software developers don't need to worry about it.

  • @qm3ster
    @qm3ster Před 2 lety

    No CPU gets data "before it needs it" :v
    Going to main memory is really, REALLY slow (compared to anything else CPUs spend time doing these days).
    So, are any cache layers shared between the chiplets?

  • @steve1978ger
    @steve1978ger Před rokem

    If I were to guess how they did this, I'd say they've made their memory bus expandable in the first place, like having an extra bit on the address bus etc.

  • @centerfield6339
    @centerfield6339 Před 2 lety

    I don't really understand this - if the NUMA architecture lets you access the other memory as fast as local CPUs, then doesn't the original contention problem become an issue again? I thought that's what the video would end with, given it was teed up like that.

  • @pierreabbat6157
    @pierreabbat6157 Před 2 lety

    How do the CPUs handle it when two separate CPU chips, each with a cache, try to *write* to the same location? This can happen if the location is a mutex.

    • @katbryce
      @katbryce Před 2 lety

      This shouldn't happen. It does though, very frequently, and is the cause of most security vulnerabilities.

  • @5urg3x
    @5urg3x Před 2 lety

    I very clearly remember the days of dual socket (like multiple physical CPUs with their own memory) workstations. It looked cool on paper, but in the real world, it usually didn't work out very well. Many times, even with software optimizations, it was more efficient (and simpler logistically) to just use one physical processor, rather than to attempt to have them both working together on the same task or set of tasks, and having to swap data in and out of cache and memory, etc. For servers, it could work, but most workstation workloads just aren't going to benefit from that type of an architecture.

  • @BR-lx7py
    @BR-lx7py Před 2 lety +2

    Can the operating system take care of always allocating memory from the block that is closer to where the process that is requesting it is running? I know it's not perfect, but would work 90% of the time

    • @moritzhedtke8139
      @moritzhedtke8139 Před 2 lety +1

      Linux actually does as far as I know

    • @ssvis2
      @ssvis2 Před 2 lety +1

      To a certain extent, yes the OS can. However, in order to effectively to that, it needs some information about memory requirements and usage patterns for a process. Some can be gleaned from the raw byte code, especially if there are hints placed by the programmers, but a lot will come from actually running the process, then dynamically remapping and moving memory as needed. It'll work better for long-running processes, but is by no means optimal. That's why most super high performance programs, such as many video games, manually set CPU core affinity and utilize custom memory allocators to provide direct control over memory locality. They'll even go as far as detecting which are fast and slow cores and prioritize from there.

    • @shunyaatma
      @shunyaatma Před 2 lety +1

      Yes, the Linux kernel can take care of this. The default memory policy (MPOL_DEFAULT) makes the page allocator always try to allocate memory from the local node but if that's not possible, it uses a different node. Over time, even if pages get scattered across NUMA nodes, Automatic NUMA Balancing will either try to move the pages to the node from where they were accessed the most or try to move the program itself to run on a CPU that is close to the memory that it accesses the most.

  • @X_Baron
    @X_Baron Před rokem

    Ultra Fusion is basically Blast Processing, but more extreme and rad.

  • @jfmezei
    @jfmezei Před 2 lety +2

    Great to find someone who remembers NUMA !!
    BTWk you forgot to deal with cache coherence. Core 1 modifying contents at a memory location that is also in core 2's cache.
    In the 1990s, Digital tried to scale its Alpha computers to have many cores with its Wildfire class machines. They found that 4 cores was the max the memory controller could handle before performance increments stopped beingf interestiung. So they created the Wildfires with 4 CPU "QBB" that were boards, connected by what Digital called a switch. NUMA access between these QBBs was atrocious.
    This was dealth with at the operating system level, less so at application. You could pre-load shareable images onto a specific QBB and then launch processes that use them on that QBB so they would use local memory for shareable images etc. But this was nowhere enough.
    Digital then worked on the next generation alpha the EV7 which was delayed as long as they could because Compaq/HP who had bought Digital didn't want EV7 to beat the pants off the Intel Itanium heat generator.
    The EV7 introduced a totally new memory controller that remained state of the art beyond the death of Alpha. HP donated Alpha IP to Intel which used it for its CSI interconnect (later called Quickpath) and which evolved from there. ex-Alpha engineers went to AMD who developped their own version, and many ex-Alpha engineers formed PA-Semiconductors which was purchased by Apple to create its own ARM chips. The EV7 had coherent cache (and I beleliev only IBM's Power had this until AMD matched it. Intel's Quickpath did not implemnent coherent cache initially (despite having all the IP from DEC).
    If you google for Alpha Wildfire NUMA, you will find a result "Optimizing for Performance on Alpha Systems - Semantic..." by Norm Lastovica. It provides some then ciurrent memory accesses showing differences between direct and NUMA accesses in the Wildfires. But at page 26 also provides the EV7 memory archicteeture in a fabric. (21364 is the EV7 CPU, the first generation was 21064). Each CPU controlled a part of RAM. But because CPU 1 could request memory from CPU2 at same time as CPU3 requested from CPU4, CPU5 from 6 etc, it ended up having huge performance advantage when scaling number of cores.
    There was also an issue of CPU speed vs memory speed. Alpha came to surpass memory speed easily hence the 4 core limit Digital found in the 1990s. But when you increase memory speed (and it has increased tremendously since then), it lets you increase number of cores that have direct access (especially in last littel while when "Moore's Law" was more about adding cores than making each core faster.
    Before their death, Digital engineers would present at DECUS comferences and provide much information about Alpha advancements and how they improved thinsg etc. It is a real shame that Apple hides all the real information and only rpovides marketing gobledeegook that is useless.

  • @dustinmorrison6315
    @dustinmorrison6315 Před 2 lety

    Hopefully my programs are not fetching instructions from RAM often enough for it to matter. Hopefully they're somewhere in the L1i,2,3,4 caches.

  • @DalasYoo
    @DalasYoo Před 2 lety

    CCIX Hooray!

  • @autohmae
    @autohmae Před 2 lety +1

    I wonder if Linux scheduler already has a variable for the latency, so no new code is needed. My guess would be yes.

  • @larrystone654
    @larrystone654 Před 2 lety

    So if I understand this correctly, the Ultra architecture makes it so developers don’t *have to* split their code across cores, but I suppose they *could* in order to achieve even more performance?

    • @magicmark3309
      @magicmark3309 Před 2 lety +1

      Unless you really really really know what you’re doing you’d probably just cause more bottlenecks. No real point with these Macs, unless you were clustering them, which would be cool but probably not a great use of resources when it comes to all the enterprise solutions.

  • @vladomaimun
    @vladomaimun Před 2 lety +2

    Does application software needs to be NUMA-aware or does the OS kernel handle everything NUMA related?

    • @JamesClarkUK
      @JamesClarkUK Před 2 lety +3

      The OS could do scheduling to keep your application on one numa node. You can use numactl on Linux to tell the kernel what you want to happen

    • @RunForPeace-hk1cu
      @RunForPeace-hk1cu Před 2 lety

      The whole point is it’s HW implementation and no software need to be changed.

  • @andredejager3637
    @andredejager3637 Před 2 lety

    wow thanks 😊

  • @bumbixp
    @bumbixp Před 2 lety +12

    Doesn't the OS scheduler largely handle this? Even if you make a single threaded app, Windows will move it around on different cores but it stays within the same NUMA node.

    • @Pyroblaster1
      @Pyroblaster1 Před 2 lety +1

      Lets say you allocate a buffer and load data into memory in a single thread and then start many threads to process that data, which is perfectly reasonable and usual way to do things with uniform memory access. Then if you saturate the system with threads, half or more of the threads will run on NUMA nodes that are different from where the data buffer was allocated, incuring the longer access times. You have to explicitly handle the allocation and data loading so that the data is distributed in a way that the threads processing each part of data are on the same NUMA node as the data they are processing.

    • @Piktogrammdd1234
      @Piktogrammdd1234 Před 2 lety +1

      Yes and no. There are mitigations on every level to compensate for problems, but every solutions is just bad in comparison to an idealistic system with endless Memory, zero latency, no collisions. OS schedulers try to localize data and corresponding processes, but limits are still there. Every time processes on a node are in need for more memory than available locally or processes are relocated to other nodes will be problematic.

    • @ivanskyttejrgensen7464
      @ivanskyttejrgensen7464 Před 2 lety

      The OS tries to handle this, but it's not perfect. Eg. last time I dealt with this the OS tried serve memory allocations from the nearest memory, but wouldn't move it around afterwards. So we ended up using processor sets to direct processes to be started at the "right" part of the CPUs so the subsequent memory allocations could all be served from the local memory. That gave a 10-15% speedup compared to leaving it to the OS to figure things out.

  • @radutopor8389
    @radutopor8389 Před 2 lety +3

    I still don't get why splitting the RAM in two wouldn't cause the same collisions problem with the high number of CPUs, given they effectively still share just one bus, albeit connected by some black box in the middle.

    • @YeOldeTraveller
      @YeOldeTraveller Před 2 lety +2

      Because the two NUMA regions are separate, any access in one region does not impact access in another region. Even without coding for it, you reduce the likelihood of collision.

  • @JCBOOMog
    @JCBOOMog Před 2 lety +4

    Hi steve

  • @yashkumarsingh9713
    @yashkumarsingh9713 Před 2 lety

    10:39 Why does the cpu of one ram needs to go to the other ?

  • @Ojisan642
    @Ojisan642 Před 2 lety +1

    Was this filmed on board a ship at sea?

    •  Před 2 lety

      😭😭

  • @heaslyben
    @heaslyben Před 2 lety

    Is this video about 3-CPU? Do you speak Bocce??

  • @TheOisannNetwork
    @TheOisannNetwork Před 2 lety

    Nice lights 😉

  • @MattyHild
    @MattyHild Před 2 lety

    Interesting video, but I’m afraid it doesn’t fully touch on how NUMA solves the bus contention issue from UMA type systems. Especially with the implication that you don’t need to program appropriately for a NUMA system in the M1 ultra. If you program agnostically to the non uniform memory, you effectively have a UMA system again.
    I get adding a second RAM bank boosts bandwidth but why not have a HBM style memory instead, especially stacked die HBM

  • @marcomaida1731
    @marcomaida1731 Před 2 lety

    I don't understand how the collision topic fits with the explanation of NUMA and the M1. Looks like once we have this very fast DSM, we are back at the problem of the beginning, that is, we will have many collisions

  • @l.matthewblancett8031
    @l.matthewblancett8031 Před 2 lety

    WHERE DID YOU FIND THAT 1972 printer paper??!?!! lol.

  • @fernandoblazin
    @fernandoblazin Před 2 lety

    when is the last time i saw that type of paper