CPUs Are Out of Order - Computerphile

Sdílet
Vložit
  • čas přidán 18. 01. 2018
  • Spectre and Meltdown showed up holes in the hardware implementation of CPUs, but what exactly are the exploits targetting? Dr Bagley dives into the detail.
    Cache video : • Why do CPUs Need Cache...
    EXTRA BITS: • EXTRA BITS - CPUs & Sp...
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottscomputer
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Komentáře • 331

  • @flamencoprof
    @flamencoprof Před 6 lety +142

    I'm 67 yo. I'm amazed that the training I was given in the 80's on early microprocessors, combined with the fun I had writing op-codes for my Commodore 64 enabled me to follow this. Thanks for the instruction!

  • @ShawSumma
    @ShawSumma Před 6 lety +377

    He is a really slow c compiler. I’ll stick to my usual command GCC.

    • @ShawSumma
      @ShawSumma Před 6 lety +100

      Very verbose also

    • @leberkassemmel
      @leberkassemmel Před 6 lety +75

      And he only supports ARM! And not Open Source!

    • @klaxoncow
      @klaxoncow Před 6 lety +89

      I don't know.
      He's telling us what he's doing and explaining it all, so doesn't that technically make him an open source compiler?

    • @leberkassemmel
      @leberkassemmel Před 6 lety +57

      Or someone just set the -v flag.

    • @moshly64
      @moshly64 Před 6 lety +2

      JSR $DEADBEEF

  • @NikiHerl
    @NikiHerl Před 6 lety +43

    I have a request / maybe constructive feedback: I think it would be neat if you could update / create new Computerphile playlists. There are tons of videos I'd like to rewatch, but it's a bit of a pain to look for them one-by-one. Specifically I'd want to rewatch all the explainations of exploits/security breaches, for example

  • @Xulfer
    @Xulfer Před 6 lety +112

    "...that we talked about in the caching video, many years ago" *cuts to video clip of the same shirt*

    • @Zivudemo
      @Zivudemo Před 6 lety +19

      Consistency is something that is sorely needed on YT.

    • @BeoandIsa
      @BeoandIsa Před 6 lety +20

      not the same shirt, look closer...

    • @ryke_masters
      @ryke_masters Před 5 lety +3

      Not actually the same shirt, but there is more than a passing resemblence...

    • @felipemartins6433
      @felipemartins6433 Před 3 lety +4

      _tom scott wants to know your location_

    • @chswin
      @chswin Před 2 lety

      That’s how you know he is the real deal…

  • @tarcal87
    @tarcal87 Před 6 lety +84

    _"using the Computerphile paper in a _*_radically_*_ different orientation"_
    such a rebel :D

  • @Thompson8200
    @Thompson8200 Před 6 lety +59

    I'd love to see an explanation of 'side-channels' and how you turn a timing of a memory operation into a specific value from memory.

    • @talleddie81
      @talleddie81 Před 6 lety +15

      The timing is not turned into a value. The timing of the operation is used to determine whether the CPU read the value from the cache or main memory.

    • @mduckernz
      @mduckernz Před 6 lety +8

      talleddie81 And, to add further to this, if you know when sensitive (eg. kernel) operations are being executed, you can figure out where they're actually stored, bypassing ASLR. This takes some time, as it's a pretty noisy side channel, but can be pretty effective, as it may take many such probing operations to gather data but as billions of operations are executed per second, it actually doesn't take much real time to get some interesting data, and the longer you do it, the more precisely you can hone in on your target address.

    • @Thompson8200
      @Thompson8200 Před 6 lety

      Since a computer might have 16+ GB of RAM how do you even start to get an idea of where in the memory you need to be looking if all you know is that it did have to hit the RAM due to the timing?

    • @talleddie81
      @talleddie81 Před 6 lety +4

      As Matthew Ducker said, it is possible to break the ASLR. What you then can figure out is where the user data and kernel data are stored in RAM. As far as figuring out what specific data is stored at each address, that is a very difficult and complicated topic. As far as your original question, the timing is only used to determine where the data came from. Knowing that the data came from the cache can be a clue to an attacker that the data was from a previous operation. In the case of an attack, this previous operation could be a memory read forced by the attacker that should not have occurred.

    • @Thompson8200
      @Thompson8200 Před 6 lety +1

      Thanks for the replies!

  • @martinkunev9911
    @martinkunev9911 Před 6 lety +6

    Assuming integers, some more time can be saved if the multiplication is done earlier (in can run in parallel with load instructions).

  • @erikengheim1106
    @erikengheim1106 Před 3 lety +2

    Nice job! I had to click through a few explanations before I got to this one. Went straight to the point and kept me engaged, without getting buried in technical details.

  • @edmundkorley8892
    @edmundkorley8892 Před 6 lety +23

    Thank you for mitigating the screeching sound of the markers!

  • @tsmupdatertsm7633
    @tsmupdatertsm7633 Před 6 lety +4

    Thanks alot for your work! I really like those videos with Dr. Bagley. He explains everything very well. And the deep level of how computers work is very interesting.

  • @sebastiankumlin9542
    @sebastiankumlin9542 Před 4 lety +1

    Its just amazing how much time goes into making these videos. Thank you!

  • @magnum333
    @magnum333 Před 6 lety +8

    What a great channel, thank you for this.

  • @scatterlogical
    @scatterlogical Před 6 lety +8

    I think the unfortunate situation (like any security) is that this is not a pure computing problem, but a human one. Imagine how much more efficient computers and networks could be without the overhead of dealing with untrustworthy influences. :/

    • @carlosgarza31
      @carlosgarza31 Před 5 lety

      A hardware bug that allows user level computer programs access to kernel space or other user level processes memory address space defeats the purpose of having virtual memory security in the first place. We should all be outraged that speculative branch prediction doesn't block cache memory writes on instructions that failed the branch prediction. From what I can tell engineers were well aware of this problem but ignored it because they assumed it would be difficult to exploit what seemed to be the random nature of cache page reading and writing and the extra cost of blanking out a cache page or blocking the writeing of that cache page during a failed branch prediction. People wanted faster recovery during a failed branch prediction for marketing their CPUs. Now they've got more marketing by allowing them to sell spectre/meltdown proof CPUs.

  • @DavidHamby-ORF-48
    @DavidHamby-ORF-48 Před 6 lety

    Nicely presented. I thought of the CDC 7600 designed by Seymour Cray as you were using the Acorn RISC machine in your example. The 7600 was superscalar & pipelined with a multiply unit, divide unit, adder, load/store unit, all 60 bit floating point. Integer operations were 48 bits using the same units but exponent fixed at zero. The Fortran compiler did critical path scheduling of expression evaluation in code generation. An instruction word stack handled decode and issue. Tight loops fit in the IWS and executed without instruction fetch.

  • @debanikdawn7009
    @debanikdawn7009 Před 6 lety +26

    "I'm out of order?! You're out of order! The CPUs are out of order!"

    • @bwzes03
      @bwzes03 Před 6 lety +5

      Debanik Dawn If I was half the CPU I used to be, I'd take a pipeline to this place!
      Out of order ? Who do you think you are talking to? I've been around you know!

    • @SproutyPottedPlant
      @SproutyPottedPlant Před 6 lety

      The Orona lift(elevator) is out of service! Out of order! press the alarm button

  • @nullptr.
    @nullptr. Před 6 lety

    Thanks for explaining how that works! great editing

  • @larryg2320
    @larryg2320 Před 6 lety +1

    Since Dr. B is right-handed I would like to recommend that the camera be located over his left shoulder instead of his right.
    Love the shows.

  • @appychd
    @appychd Před 6 lety +9

    Very well explained

  • @KnightRiderDDR
    @KnightRiderDDR Před 6 lety +33

    It's really funny how for the past 20 years no one mentioned this issue, but now when it is known the comment section of every video about Meltdown and Spectre is full of experts on the matter.

    • @gordonrichardson2972
      @gordonrichardson2972 Před 6 lety +8

      When the X86 architecture started out more than 40 years ago, the design was entirely open, and exploiting flaws was trivially easy. Security features have been added in layers over the last few decades , while maintaining backward compatibility of instruction sets and memory addressing modes. At the same time numerous enhancements have been added, all adding to overall complexity.
      This is not how you would design a secure CPU from the ground up, and it does not surprise me when vulnerabilities proliferate. Trading-off speed and convenience, versus security and robustness, is seldom a winning strategy.
      On a personal note, some of us old fogeys were around 20-30 years ago, writing low-level machine code and understanding how the CPU worked, and well aware of (some of) the vulnerabilities.

    • @0xCAFEF00D
      @0xCAFEF00D Před 6 lety

      Well I have similar surprise. Not that people know about it but that I've seen multiple new popular programmer friendly sources on pipelining and how it works just this year. Before specter and meltdown. It's an odd coincidence and I wonder what the catalyst is. Maybe it's just me being human and seeing patterns where there are none. But Cppcon had a talk covering it just now in 2017 and I can't recall any other talks that have. I've watched those a lot.
      I was introduced to this in 2013-2014 I think.

    • @gordonrichardson2972
      @gordonrichardson2972 Před 6 lety +1

      One popular issue that underlies this is the simple question: Why should I upgrade to an expensive new CPU, when due to heat dissipation limits, the maximum clock speed is pretty much the same as last year's model?
      Moore's Law has not ended, but it continues to be implemented in ways that are not obvious to the layperson. With previous generations of processors the differences were large and quantifiable. Now its all about cache size, incremental improvements, and reduced power consumption.
      IMO discussing these fundamental factors have forced the topic of speculative execution into the public consciousness, whereas it was previously known only to a limited number of geeks...

    • @KnightRiderDDR
      @KnightRiderDDR Před 6 lety

      It is kind of strange how this issue was revelaed when according to some we have reached the limit of traditional CPUs (silicon chips). If it is not a mere coincidence I can speculate and say that now that silicon chips can't get more powerful at the same rate that they were before CPU makers will have to find another way to pitch us their new products: "Look at our new CPU. It is not more powerful than the our previous ones but it has new architecture and is not vulnerable to Meltdown and Spectre so you better buy it!" But this is ONLY a speculation. I have my doubts that Intel would be willing to lose so much stock value over this.

    • @HenryLoenwind
      @HenryLoenwind Před 6 lety +4

      This general issue has been known for a long time---cryptographic processors are hardened against it. Those things aren't used because they are faster (often they are not), but because they take extra measures against a variety of out-of-band timing attacks. This is just the first time someone looked for and found a way to exploit it on a general purpose CPU with usable results instead of just some academic "oh, interesting". (Also, add media hype.)

  • @thejedijohn
    @thejedijohn Před 6 lety

    Great Video!!!
    I still have some questions:
    What part of the CPU looks at the instructions and evaluates a better order to execute them in? How does that not take more time than just executing in the order they were given? And do compilers like GCC rearrange the order first, or is it usually the cpu's job. If the C compiler does rearrange the order, can it inform the CPU that it's already been optimized, and to not waste time checking?

  • @leberkassemmel
    @leberkassemmel Před 6 lety +16

    Anyone noticed the CD hanging out of the right iMac?

    • @billparsons3341
      @billparsons3341 Před 6 lety

      Anyone notice that he was wearing the same shirt in the cache flashback video from a few years ago?

    • @kigtod
      @kigtod Před 6 lety +3

      Bill Parsons yes - Looks like Sean's continuity briefing paid off.

  • @47Mortuus
    @47Mortuus Před rokem +1

    FYI - the way this fictional CPU executes the code also uses Instruction-Level-Parallelism. I don't think there is any useful CPU design that has either but not both, which means they go hand in hand.

  • @colt4547
    @colt4547 Před 6 lety

    Excellent video. Thank you!

  • @SparxableTunes
    @SparxableTunes Před 6 lety

    Dr. Bagley always delivers to the forefront of my curiosities. I hope to be an example of one of the individuals who may never see the footsteps of higher education, and however prove that we can indeed continue to prove ourselves as veritable compliments to the field of computer science.

  • @JoshuaHillerup
    @JoshuaHillerup Před 6 lety +30

    I'm confused why the processor would ever do the optimizing, instead of a combination of the compiler/interpreter (for the particular bit of code) and the OS (for different processes and whatnot) doing all the optimizing, since those actually have all the information about what will be run.

    •  Před 6 lety +32

      Actually they do not have all the information about what will be run (if that were the case, it could speed programs a lot). You have to take into account the dynamic factors. Values in cache, branch prediction, utilization of individual cores (hyperthreading) etc. all affect the program execution severely and they're very hard to predict during compilation (although compilers of course try to do their best and you can help them with profile guided optimization).

    • @gordonrichardson2972
      @gordonrichardson2972 Před 6 lety +17

      The processor is the only one that knows whether an item has been fetched from memory previously, and is in the cache, which provides a huge speedup. The compiler cannot possibly know the contents of the cache, although it should do some optimisation of its own.
      BTW, modern software can be rather inefficient, and it if weren't for fast CPUs, things would sometimes go very slowly...

    • @radarspace
      @radarspace Před 6 lety +6

      That's exactly how Intel's Itanium CPUs work.

    • @zeikjt
      @zeikjt Před 6 lety +5

      If the compiler or interpreter were to try and it do it would be what's known as a premature optimization because you'd be optimizing for an assumed/theoretical cpu instead of knowing what it's actually capable of. It could be that your optimizations work well for a select few or even a great number of cpus on the market today, but tomorrow will come and new cpus will be released and your modified code could very well perform worse on those. You should just let the cpu itself do what it knows it can do.

    • @JoshuaHillerup
      @JoshuaHillerup Před 6 lety

      ZeikJT if your compiler knows what CPU it will run on (and given the size of actual executable machine code versus the size of storage there's no reason not to include all existing CPUs), then it can target all of them. If a new CPU is built you can recompile your code to make it the most optimized.

  • @gideonmaxmerling204
    @gideonmaxmerling204 Před 3 lety

    with programs like these, many modern CPUs will send a few memory fetch requests one after the other.
    while the CPU is waiting for the memory it usually does other tasks.
    when the memory arrives, it might arrive out of order (out of order as in, you get b, then a, then d, then c)
    so it will compute the calculations by the order of arrival.

  • @alancurssow9030
    @alancurssow9030 Před 6 lety +1

    I like this guy, thank you very much for your time - very informative

  • @irenef8373
    @irenef8373 Před 6 lety

    Great explanation. Thanks.

  • @momokoko8811
    @momokoko8811 Před 6 lety +2

    If the assembly was originally written in the optimal order, will the CPU's useless attempt to reorder them cause an overhead?

    • @gordonrichardson2972
      @gordonrichardson2972 Před 6 lety +1

      Not likely. During design and testing the CPU will be optimised to avoid this kind of wastage. Modern processors actually have huge amounts of overhead, but this is all geared towards the fastest outcome. Low-power alternative processors that have less overhead, continue to be available for specialised applications.

  • @peterbustin2683
    @peterbustin2683 Před 5 lety

    Really very interesting! Thank you..

  • @Revan12345678
    @Revan12345678 Před 6 lety +1

    Another thing that I noticed, that wasn't mentioned in the video, is that reordering the code also opens up registry space for reusability.
    For example (6:55)
    load r0;
    load r1;
    add r0 = r0+r1;
    load r2;

    • @gordonrichardson2972
      @gordonrichardson2972 Před 6 lety

      Valid point, but that opens up a whole new layer of complexity...

    • @Revan12345678
      @Revan12345678 Před 6 lety

      Hey, if designing a CPU was easy, everyone would be doing it xD

    • @KohuGaly
      @KohuGaly Před 6 lety +1

      yes, intel CPUs can actually do this. However, they typically do the exact opposite:
      Consider this code.
      ...
      add r0 = r0+r1;
      load r1;
      ...
      Notice that the load instruction needs to wait for the add instruction to finish, because they use the same register. Intel CPU will simply use different free register in the load instruction and adjust the rest of the code accordingly.
      ....
      add r0 = r0+r1;
      load r2;
      ....

  • @awirstam
    @awirstam Před 6 lety

    Maybe a off topic question about CPU´s. The question is in the time frame of around 1998 - 2006. Was the PowerPC actually faster than the x86 as apple always stated even though the clock frequency was a lot lower.

  • @eldebo99
    @eldebo99 Před 6 lety +2

    The color palette at 5:53, the left side, with example line "01 LDR R0, a", is challenging to read by my color-deficient eyes. Please reconsider that particular font / background color combo.

  • @jaywye
    @jaywye Před 2 lety

    How does an out-of-order CPU work? Is there a separate module that reorders instructions?

  • @Disthron
    @Disthron Před 6 lety

    Super scaler? There was a Sega arcade hardware called the Sega Super Scaler. Though I think that was referring to its ability to scale sprites though. Look at games like After Burner, Outrun and Thunder Blade. Just to name a few.

  • @jonahansen
    @jonahansen Před 6 lety

    Very well explained!

  • @PaulsPubAndBrew
    @PaulsPubAndBrew Před 6 lety +1

    Why wouldn't it take more cycles to analyze and determine an optional order than you'd save by using that new order? Does the compiler that originally compiled the code handle this? Or is this truly on the fly?

    • @gordonrichardson2972
      @gordonrichardson2972 Před 6 lety

      Moderns CPU's have sufficiently complex hardware to analyse instructions several steps before they are actually executed. IMO the example chosen is simplistic, and not a good example of how pipelining works in practice.

  • @bananya6020
    @bananya6020 Před 3 lety +1

    tl;dr: optimization isn't all about using the fewest instructions, it's about using them in the right order and sometimes using a less "efficient" instruction to achieve parallelization so you can use as much of the CPU's power at once as possible.

  • @ifell3
    @ifell3 Před 6 lety

    It's mind blowing to think how much stuff is wrote and executed just for something easy that we all take for granted!!

  • @simonnomis123321
    @simonnomis123321 Před 6 lety

    Shouldn't you run B,C, and D at the beginning so the multiply can run at the same time as W?

  • @linawhatevs8389
    @linawhatevs8389 Před 6 lety

    12:40 actually, instruction 8 (MUL) could happen earlier, during 6 and 7. It still wouldn't be faster than the reordered code, though.

  • @richardmiklos
    @richardmiklos Před 6 lety +1

    Can the clones execute Order 66, while the CPU is executing these instructions? I mean they don't depend on each other or anything.

  • @dharma6662013
    @dharma6662013 Před 6 lety +14

    Wouldn't the time taken by the CPU to re-order the instructions wipe out any time gained by being able to perform those instructions in parallel? In other words, re-ordering the instructions makes it quicker to do them, but you waste time re-ordering before you can start.

    • @mduckernz
      @mduckernz Před 6 lety +19

      dharma6662013 No, as this is usually performed by the decoder. The ALU and L/S units aren't yet involved. At this stage they will also perform things like checking to see whether data needed by the decoded operation requires data not in cache - if it's not, it will be prefetched, so that it is in cache when it's needed later. This is also where branch prediction comes in - if a branch hasn't been executed yet, it doesn't know whether data used by each branch will be needed, so it will gather the data for the operations involved in the branch it predicts will be taken based on previous behaviour. It may also perform speculative execution (this depends on the design of the specific CPU implementation)

    • @dharma6662013
      @dharma6662013 Před 6 lety

      Please forgive my ignorance, but that just seems to "kick the can down the road". Something, somewhere, has to spend time re-ordering things. The result is that the CPU can run things faster. How do we know, and how to we measure, how much the time used re-ordering compares to the time saved *by re-ordering*?

    • @vringar9792
      @vringar9792 Před 6 lety +1

      dharma6662013 I would assume that chip designers and their respective companies have done quite some testing on this.
      You might want to look up which generation of chips was the first one to implement such a thing and how much faster they got.

    • @vringar9792
      @vringar9792 Před 6 lety +1

      dharma6662013 tl;dr: thinking about how long something might take is faster than doing it.

    • @DFPercush
      @DFPercush Před 6 lety +15

      CPUs have an instruction prefetch where the next instructions are loaded into cache before they are executed, usually in 16-byte segments. That gets into branch prediction, and what if you jump to a different address. But the main takeaway regarding instruction reordering, and pipelining in general, is that it can be done _combinatorially_ - meaning a logic circuit that does not use clock cycles, but acts as a direct function on its own. As soon as you feed in the input, given some gate delays, the output appears on the other side. For the purposes of this discussion, just think of it as being an instant process. It's a very long and complicated "if" statement that happens all at once in hardware.

  • @pontuz2
    @pontuz2 Před 6 lety +5

    Is there any overhead in the CPU by re-ordering the instructions during OOE?

    • @FrodorMov
      @FrodorMov Před 6 lety +6

      Well the CPU, or the execution of instructions is not used for reordering. Within the CPU, obviously, some component is required to analyze instructions and dependencies to re-order them. Obviously this costs some area on the chip, and energy, but in the end it should make execution faster.

    • @pontuz2
      @pontuz2 Před 6 lety +3

      Thanks for the reply. Now that I think about it, the overhead of a potential re-order (+ new execution time) obviously has to be smaller than the original execution time in order to actually enhance the performance.

    • @gordonrichardson2972
      @gordonrichardson2972 Před 6 lety +7

      The main benefit of out-of-order execution is not to re-order the instructions, but to ensure that the CPU doesn't sit idle while waiting for data to be fetched from memory. In almost all cases there is something else useful that can be done, rather than doing nothing!

    • @xponen
      @xponen Před 6 lety

      What if we re-order the instruction ourselves? would the CPU still do the re-ordering part?

    • @BrianCairns
      @BrianCairns Před 6 lety +13

      In short, yes. Out-of-order designs typically require *much* more die area compared to an in-order design, and they also tend to use more power. In-order designs need higher clocks to have the same performance as an out-of-order design, but they still tend to be more efficient for low-medium performance levels. For the highest performance, you just can't clock an in-order design any higher (or it becomes inefficient to do so), and an out-of-order design is better.
      There are a number of modern, medium-performance in-order designs for exactly this reason, most notably the ARM Cortex-A53, which is the primary core used in virtually every low-end and mid-range smartphone (because of cost). The Cortex-A53 is also paired with higher-performance cores in higher-end smartphones, which allows the higher-power out-of-order cores to shut off when the phone is idle or under light loads (ARM calls this big.LITTLE; there's also a new version called DynamIQ).

  • @DanielMarrable
    @DanielMarrable Před 6 lety +1

    I would like to see him explain hyper-threading

  • @RPG_ash
    @RPG_ash Před 6 lety

    Very interesting, thanks.

  • @rcookie5128
    @rcookie5128 Před 6 lety

    Super informative!!

  • @ms-ex8em
    @ms-ex8em Před 3 lety

    Did Lander have sound too?? Thanks.

  • @ms-ex8em
    @ms-ex8em Před 3 lety

    Hello did Lander ever have sound at all?? Thanks.

  • @TheDuckofDoom.
    @TheDuckofDoom. Před 6 lety

    And now we move to multi core cache management and prefetching?

  • @skyler114
    @skyler114 Před 4 lety

    Literally programming a queue problem for an assignment as I'm listening to this

  • @retop56
    @retop56 Před 6 lety

    Great video.

  • @dichebach
    @dichebach Před 6 lety

    Interesting stuff!

  • @HerrLavett
    @HerrLavett Před 6 lety

    Nice! Thank you!

  • @WanderAway
    @WanderAway Před 6 lety

    While we're here, may I suggest another video on how adders/multipliers are built in the CPU itself? Maybe explain the difference between ripple carry adders and carry lookaheads and that kind of stuff :D

  • @Treviath
    @Treviath Před 6 lety

    Needs a follow up video on how the bugs work themselves

  • @luckyluckydog123
    @luckyluckydog123 Před 6 lety +1

    BTW I think it was the Pentium Pro from 1995 the first Intel CPU with out-of-order (as well as speculative) execution. The original Pentium (1993) didn't support those features, ASAIK.

    • @jasondoe2596
      @jasondoe2596 Před 6 lety

      I think the Pro was indeed the first Intel with speculative execution, not sure about out-of-order.
      *edit:* apparently both

    • @snkline
      @snkline Před 6 lety

      The original Pentium was superscalar but didn't support OOE that is correct. In the P5's case it had two execution units that could execute instructions in parallel, but it didn't make any decisions more complicated than "Can I execute the next instruction in the second pipeline or not". The Pentium tried to pair off instructions. Pairs could enter both pipelines, while unpaired instructions could only enter the primary pipeline.

  • @flyball1788
    @flyball1788 Před 10 měsíci

    Spent my life on the H/W side of the fence as a developer, and have NEVER understood why problems like this, which could be addressed by having architecture-specific compilers written once and used once to generate optimised code, are always moved into H/W creating massive complexity (and hence bugs that turn up months later and can't be retro-fixed) and burning power on every single execution cycle on every single machine every single time it runs that bit of code.
    I agree that, usually, generalisation = slow and optimisation = complex, but surely it's only logical to put the complexity into that part of the system that can be easily changed when problems arise (as they always do with complexity) and which only entail effort/energy/time once at the start of the process. For H/W, the KISS mantra reigns supreme and complexity should be reserved for those things that can't be done up-front.

  • @KipIngram
    @KipIngram Před 2 měsíci

    I think we took a misstep in processor design decades ago. Modern processors have become so complex that no one person can understand all of them (I mean really, REALLY understand - down to the gate level of what's going on in all cases). As a result, we wind up with things like Spectre/Meltdown and so on, which happen because the left hand doesn't know what the right hand is doing. What we chose to do decades ago was to add complex logic to our cores, in an effort to get them to execute code faster. We've gotten to the point where all that stuff represents more of the logic on the chips than the actual compute logic does.
    What we should have done instead was to embrace the multi-core idea much, MUCH sooner. We should have kept our cores dirt simple, and just piled more and more and more of them onto the chip. Use ALL of the logic for the business of computing. Of course, this would have required us to face multi-thread programming much sooner than we otherwise did, but we've wound up having to face it anyway. If we'd just swallowed that pill sooner then we would NOT have processors that no one can understand and I wager that we would have much more secure, reliable systems that didn't plague us with all of the difficulties that our current processors do.
    You can't really say "That wouldn't have worked as well," because we DON'T KNOW. Software would have evolved in a different way, and we don't have the software we'd have gotten from that other path, so we don't really know where we'd be on overall performance at this point.
    We let the tail wag the dog at every turn, though, and now we are where we are. I don't know if there will ever be a way out. Generally speaking, though, I oppose letting whatever body of legacy software we happen to have "at the moment" dictate how we design future hardware. The hardware design should lead, and the software design should follow.

  • @y__h
    @y__h Před 6 lety +2

    On the serious note though, rather than superscalar architecture, isn't it more effective if we put two pipelines in the CPU and both of them sharing the same execution units?

    • @postvideo97
      @postvideo97 Před 6 lety +1

      Yoppy Halilintar This is what SMT does I believe.

    • @galier2
      @galier2 Před 6 lety +1

      Short answer. No.

    • @mduckernz
      @mduckernz Před 6 lety +2

      postvideo97 In a sense, yes, except that there is only one pipeline. While one thread has a particular execution unit tied up - say, waiting for data to arrive from main memory, which can take hundreds of operations in CPU-time... note that this would occur due to a failure in branch prediction; normally, it would have already noticed ahead of time that this data would be required and requested it in advance already, so that it would already be in cache or even a register, unless it predicted wrongly that it wouldn't be required - you can instead execute operations for a different thread that doesn't need that data, or that execution unit.

    • @jasondoe2596
      @jasondoe2596 Před 6 lety

      Yoppy Halilintar, two pipelines sharing the same execution units is pretty much _the opposite_ of what you want, because delays during the execution and complex dependencies would "stall" _both_ of them.
      *edit:* Matthew is right, that's not what SMT (aka hyperthreading for Intel) does.

    • @jasondoe2596
      @jasondoe2596 Před 6 lety

      Guy Maor, how does multicore "share the same execution units" ?!

  • @Brutaltronics
    @Brutaltronics Před 6 lety +8

    the whole freaking system is out of order!

    • @RoboBoddicker
      @RoboBoddicker Před 6 lety

      Cause when you stick your hand into a pile of goo that was your BEST FRIEND'S FACE, you don't know what to do!!

  • @avrohomhousman5958
    @avrohomhousman5958 Před 4 lety

    is this the same as pipelining? It sounds very similar.

  • @wherestheshroomsyo
    @wherestheshroomsyo Před 6 lety

    4:20 that "c" is moving! What? Did that happen in editing?

  • @MichaelQuantum
    @MichaelQuantum Před 6 lety

    If people would compile their own software, you could do all this optimization with the compiler and CPUs could be a lot more simple with much less power draw while still being just has fast in the final execution.

  • @qwmf05gcpt42
    @qwmf05gcpt42 Před 6 lety

    How will they make future CPUs?

  • @rafaelrui7457
    @rafaelrui7457 Před 6 lety

    Do you have a PATREON page to collaborate?

  • @JoQeZzZ
    @JoQeZzZ Před 6 lety +5

    Wouldn't it be more benificial to do the multiplying first? Because surely a MULT takes more time than an ADD?

    • @mduckernz
      @mduckernz Před 6 lety +6

      Joris Not necessarily. It depends on the particular values. Some multiplications can be done in a single cycle. Notably, power-of-two multiplications (for integers, anyway) will just be converted to bit-shifts (a single cycle operation), but there are still others that may also take only a single cycle.
      Divisions are worse (again, except powers of two, which are just bit-shifts for integers), particularly modular division. These can take many cycles.
      The implementations of ALUs have many complex tricks to allow for very fast execution - I recommend reading more about them! :)

    • @nikoerforderlich7108
      @nikoerforderlich7108 Před 6 lety +2

      In this particular case it would! If you fetch d and e first, you can do the multiplication while a, b and c are being fetched.

    • @JoQeZzZ
      @JoQeZzZ Před 6 lety

      Guy Maor yeah, so he showed hoe the processor would use OOE to speeds things up. If it would've been donr right it would choose to do the multiplication first in most cases (since a multiplication consists of bit shits and adding instead of just adding 2 numbers). This would mean that at the end of the line it would have to wait on an ADD instead of a MULT, which would speed the whole process up sligjtly

  • @HappyBeezerStudios
    @HappyBeezerStudios Před 6 lety

    For the here shown code a second load/store unit would speed uop the execution imensely.

  • @mrblue728
    @mrblue728 Před 6 lety

    This is such a relaxing stuff for my high-level language oriented brain.

  • @ITR
    @ITR Před 6 lety

    So you're saying they're not CPU aligned? Do we have to talk about parallel universes?

  • @joshhayes3433
    @joshhayes3433 Před 6 lety

    Having a link to the caching video would be pretty cool.

  • @jamma246
    @jamma246 Před 6 lety

    My knowledge of how a physical processor actually works is low, but I am a mathematician by trade and find this optimisation procedure quite interesting. So I don't know if what I'm about to say actually makes sense. But:
    The two set of instructions in this video only differed in the order of operations. The only data that seems to be needed to run the code in the theoretically most efficient way possible is what dependencies there are between the instructions; whether they can be run concurrently; and the timings that the processes take. I guess the rub is that the latter isn't really deterministic (or perhaps they are up to a reasonable margin of error?).
    Still: is a simple on-the-fly optimisation (that is actually implemented at the moment) essentially one which chooses processes that allow other concurrent ones? If module A of the processor is awaiting a new instruction, then first it looks at those available, then prioritises those which allow, say, for a computation on a currently unused module B (which is perhaps prioritised a slower component of the processor)... and so on in a similar fashion? I guess the mathematical structure I have in my mind is a kind of dependency tree which forms part of the data of the instructions, perhaps with some other weights so as to incentivise some processes (those which take place on slower components of the processor).
    Lots of gaps here, but I find this optimisation problem theoretically quite interesting and would like to know the current state of the art. It reminds me a lot of FP, where because of lazy evaluation you can ensure that functions are performed in an order so as to not have superfluous operations. Sounds like similar ideas could be useful here.

  • @velvetsniper
    @velvetsniper Před 6 lety

    you guys really should do a video together with level1techs

  • @nO_d3N1AL
    @nO_d3N1AL Před 6 lety

    I thought it took less than 100 nanoseconds to get data from main memory, not 200. How can we calculate this? Basing it on 4200 MHz RAM.

    • @overwrite_oversweet
      @overwrite_oversweet Před 6 lety

      For DDR4 4200 RAM with a CAS latency of 19 cycles, the time required to fetch the first word, assuming the appropriate row is already activated, is 9.5 ns. However, each _sequential_ word after that would only need 0.24 ns to fetch, meaning 4 contiguous words would only require about 10.25 ns and 8 would require only 11.25.
      Of course, if the next required word is in another column, you would have to wait the 9.5 ns again, and if it's in another *row*, then you'll need to wait even longer, as your RAM will need to be issued the Precharge command, and then the Active command on the correct row before the next Read command can be issued.
      The ALU, OTOH, would usually only need one CPU clock cycle to complete whatever it's doing, especially for a simple operation like addition or multiplication, which is on the order of 0.24 ns. Some ALUs can even do multiple such operations in a single cycle, and if you were using floating point instead of integers, it is relatively common to do multiply and add in one operation.

  • @GarethHall
    @GarethHall Před 6 lety

    Hmm interesting, I was unaware of the actual implementation of orders.

  • @MatkatMusic
    @MatkatMusic Před 6 lety

    man, talk about a fantastic breakdown of the topic!

  • @BEP0
    @BEP0 Před 6 lety +1

    Nice.

  • @Johanniscool
    @Johanniscool Před 6 lety

    The best part is when he uses the computerphile paper in a radically different orientation

  • @isaak.studio
    @isaak.studio Před 6 lety +1

    Is it (a+b+c+d)*e or a+b+c+(d*e)?

  • @abcdefghilihgfedcba
    @abcdefghilihgfedcba Před 6 lety

    Interesting!

  • @antoineroquentin2297
    @antoineroquentin2297 Před 6 lety +3

    jokes on me, i'm still using an in-order CPU (D2700)

  • @umaldo7
    @umaldo7 Před 6 lety

    Cooooooool!!!!

  • @ThorkilKowalski
    @ThorkilKowalski Před 6 lety

    I think the 386 was the first commercial superscalar processor.

  • @RowenStipe
    @RowenStipe Před 6 lety

    7:35 We've gone from landscape to portrait !

  • @lucianodebenedictis6014
    @lucianodebenedictis6014 Před 6 lety +1

    Take a shot every time he says "load store unit"

  • @VivekYadav-ds8oz
    @VivekYadav-ds8oz Před 3 lety +2

    Does CPU decide all this in real time? How does it do all this?! Isn't it just supposed to be an electromechanical part? If no software intervention occurs here, this might as well be black magic to me.

    • @gogokowai
      @gogokowai Před 2 lety

      I have the same question. I'm having trouble imagining how it could possibly be faster to make a bunch of checks on multiple instructions and cache states than it would be to just perform the add/multiply.

  • @Schnack21
    @Schnack21 Před 6 lety

    Shouldn't we perform the multiplication fist in this equation anyway?

  • @-42-47
    @-42-47 Před 6 lety

    Interesting, though it sounded like CPU's were out of order rather than (still) being out of order.

  • @terrahertz5284
    @terrahertz5284 Před 6 lety

    I didn't see any initial Clear Carry.

  • @sicksock435446
    @sicksock435446 Před 6 lety

    This video taught me how to play the game Silicon Zeros...

    • @Roxor128
      @Roxor128 Před 6 lety

      Thanks for reminding me I need to put in some more time on that. I'm still in the early piece-of-cake puzzles. Well, they certainly are compared to where I got up to in TIS-100 and Shenzhen I/O.

  • @halistinejenkins5289
    @halistinejenkins5289 Před 6 lety +1

    a man's man

  • @kevincozens6837
    @kevincozens6837 Před 6 lety

    Nice explanation of "out of order" execution. I knew you were going to make one minor mistake, not that it matters for the point discussed in this video. You threw in multiply as the operation before that last variable. You didn't take into account the typical order of operations. The multiply would be executed first.

  • @KX36
    @KX36 Před 6 lety

    You're out of order! You're out of order! The whole CPU is out of order! They're out of order!

  • @StanislavPozdnyakov
    @StanislavPozdnyakov Před 2 lety

    Did he said, that ARM architecture is implied?

  • @raykent3211
    @raykent3211 Před 6 lety

    Never had the problem with an Atmel s1200,

  • @policyprogrammer
    @policyprogrammer Před 6 lety

    At the end of this video he says something that I think is correct, but the entire tech media has gotten wrong about Spectre / Meltdown, perhaps because the people who wrote Spectre and Meltdown papers got it wrong themselves. Spectre is a class of attacks that takes advantage of speculative execution. The attack concept does NOT rely on out-of-order execution.
    It could very well be that OOO machines make it easier, or that only the OOO processors run far enough ahead into the speculative path to pull this attack off, but conceptually, Spectre is a speculation issue, not an OOO issue.

    • @gordonrichardson2972
      @gordonrichardson2972 Před 6 lety

      Probably true, but AFAIK all CPUs that run speculative execution, also run out-of-order execution. The reality is likely to be messy...

    • @policyprogrammer
      @policyprogrammer Před 6 lety +1

      Well, in PC-land, it all went OOO with the Pentium Pro, but the Pentium Classic and its variations had a branch predictor. But it also only had a 5 stage pipeline and dual issue, only one of which could handle a load.
      You know that to "surface" data, the meltdown code example requires the ability to get "far enough ahead" to do a speculative load followed by a second speculative load whose address depends on the value loaded in the first.
      I don't think that's possible in a short pipeline without many execution units, so older processors probably are not subject to this exploit.
      OTOH, there may be modern in-order processors that have deeper pipelines and superscalar with an LS unit and two ALUs that could be exploited. Some of the more modern ARM processors might qualify. ARM11 implementation are 8 and 9 deep. I think most (all?) of the modern ARM "A" cores are OOO, but I would not be surprised to see that some architectural licensees have built their own cores that are deep, SS, but not OOO. In MIPS-land, it may be similar.

  • @EmilBozhilov
    @EmilBozhilov Před 6 lety

    are in order processors affected by spectre and meltdown then??

    • @FrodorMov
      @FrodorMov Před 6 lety

      Not necessarily. Spectre is because of speculative execution (branch prediction), not OOO execution.

    • @Tehom1
      @Tehom1 Před 6 lety

      No, he asked if *in order* processors are affected. I expect not, since speculative execution is necessarily an out-of-order behavior.

    • @FrodorMov
      @FrodorMov Před 6 lety

      I'm not sure I agree, but this is a matter of definition. I had the same thoughts, but argued myself that speculative ex isnt the same as ooo execution. Maybe it is. I mean, in spec. ex. you're not executing executions in a different order.. Just sometimes you're 'backtracking' a bit, on a wrong prediction :p

    • @Tehom1
      @Tehom1 Před 6 lety

      OK, fair enough.

  • @mikeklaene4359
    @mikeklaene4359 Před 6 lety

    Speculative execution is NOT the problem. The fact that another process can access the results of the execution IS the problem.
    The WALL between separate processes is not being enforced.

  • @asailijhijr
    @asailijhijr Před 6 lety

    Don't most languages evaluate expressions from right to left?

  • @DanRoxtar
    @DanRoxtar Před 6 lety +3

    Damn that shirt is fly

    • @MrKinir
      @MrKinir Před 6 lety +1

      His shirts are magnificent.
      He's the embodiment of British style.
      Like, weird funky shirts.
      Reminds me of James May.

    • @SproutyPottedPlant
      @SproutyPottedPlant Před 6 lety

      He is very fly!

    • @luppa79
      @luppa79 Před 6 lety +1

      If you like British guys wearing funky shirts, you should also watch Curious Droid videos.