are stack based vms really slower?

leddoo

zhlédnutí 17 886

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 31. 05. 2024
exploring stack & register vm performance in more technical depth:
how dependencies between instructions affect performance.
previous video: • why my scripting langu...
lecture series: • Design of Digital Circ... (you can probably start with episode "11: Microarchitecture")
music:
Rush Hour Shift - Speedy The Spider
more stuff & links:
i did promise that i'd release the scripting language's source code with my next video, so here it is: github.com/leddoo/kibi but you probably won't want to look at it yet. it's still quite messy & i've basically not worked on it since the previous video. now that this video is done, i'll take a quick break and get back to work on the language.
oh, and here are the motion canvas components used in this video:
stack: gist.github.com/leddoo/3d1002...
cpu: gist.github.com/leddoo/68c719... (this one is not pretty :D)
and the rust vms: gist.github.com/leddoo/1134b0...
i really wasn't prepared for my last video to take off like that. for this video, i put a lot of pressure on myself to hit that same level of quality. which both cost me a lot of time and a bit of my sanity :D
but i'm very glad that i pushed through, and i hope that you enjoyed the video!
chapters:
00:00 - intro
00:25 - stack vms are slower, he said
01:35 - code kinda bad tho
03:07 - more = better!
04:02 - "lunch break"
04:48 - modern cpus go brrr (if u let em)
06:58 - dependencies kinda bad tho
10:27 - "conclusion"

Komentáře • 65

@devtadeo Před rokem ⁺⁸⁶
Me wondering why he uses a knife to point around
@pp_up Před dnem
It's because... it's _pointy_
@joseph199627 Před rokem ⁺⁶⁷
Very interesting, I imagine it took a lot of head scratching to really understand what was going on with those experiments. Great video
@leddoo Před rokem ⁺¹⁵
thanks!
yeah, took me about a week to figure out what was going on :D
initially, i thought it was just about the number of memory accesses, but things were just not adding up.
the experiments from the video are the ones that convinced me it was about dependencies. but yeah, there were more :D
@isotoxal Před rokem ⁺¹⁹
You are now in the list of my favourite youtubers
@leddoo Před rokem ⁺⁵
damn, big words!
thanks man 😎
@JoeHinkle11 Před rokem ⁺¹⁹
I haven't seen many channels focus on programming language development like yours. Amazing job! Also, I love your animations. Really makes it clear what's happening when you explain concepts
@SaHaRaSquad Před rokem ⁺¹²
I once wrote a small stack-based vm but had two bits left over in the encoded instructions, so I decided to just have two independent stacks and the last two bits determined which stack would be read for arguments and which would receive the result.
I didn't do a lot with it so I'm not sure about how this affected performance, but I think it could allow quite some optimizations.
Also I saw a talk about a very small chip running Forth code, and it used a stack as well as two registers, so a bit of extra space for temporary values really pays off I guess.
Oh, and because I saw your lisp-based syntax: It is surprisingly easy to write a simple Lisp-to-Lua transpiler. Lua lets you use most things as expressions and also supports tail-call optimization, so you can basically map Lisp expressions 1:1 to equivalent Lua code.
@JoStro_ Před rokem ⁺⁴
fascinating, i've never considered this aspect of performace before.
@OMFGxIamxAxNinja Před rokem ⁺⁸
Wow, this just popped up in my feed, but I really enjoyed this video! You made it really interesting!
@randomstuffgamess Před rokem ⁺¹⁶
Awesome video! I really like the 'modern cpus go brr (if you let them)' section. And the animations really helped with clarity. How did you make them? And your explanations are really on point.
@leddoo Před rokem ⁺⁴
thanks! the stack & cpu were animated using motion canvas (i think i left links to the components in the description, but they're not updated to the latest version of motion canvas)
@j-r-hill Před 11 měsíci ⁺⁴
Strongly recommend looking into work done by Anton Ertl if you're interested in stack machine compiler optimizations
@shilangyu Před rokem ⁺⁵
Subscribed. High quality content and touching exactly the topics I love.
@AK-vx4dy Před rokem ⁺⁴
Nicely explained, good work !! Keep explaining :)
@JH-pe3ro Před 29 dny
There's a strong path-dependency to how we've ended up in "registers are faster", I think(and I was aware of that before the video, though it's a great explanation of why); register architectures are intuitive when computers are viewed in terms of arithmetic processing, and they map well to local variables. Therefore we have hardware that engages with that, provides a lot of enumerated registers, and asks for languages to utilize all of them, and that has a consequence on VM design, since the VM's goal is to remap program description to efficient hardware utilization: if it looks stack-y, then the JIT rewrites it to register-y code.
On the other hand, the CPU architecture can avoid this and do something like Minimal 64x4, a non-pipelined hobby computer using discrete logic: it achieves very good performance at a relatively low clock rate by reducing the register set and optimizing the instruction set around zero page memory(a construct which exists on 6502 as well, but is used supplemental to registers), thus algorithm descriptions are smaller and the total sequential work is smaller than on other 8-bits, despite having a lower "MIPS" rating. I believe the distinction between VMs would be relatively less important on Min 64x4 because the stack VM's instruction set would also benefit from the ability to code random access efficiently.
I've started to think that local variables are worth questioning, though. They serve a certain "black-boxing" purpose of presenting short routines as a matter of loading up the registers in an initial state, doing the computation, and then writing that back using the stack as a protocol between that routine and the rest of the program. But the usual alternative to a stack protocol is a buffer, and buffers have certain benefits in terms of coordination of resource use. If the buffer were implemented like a ring buffer and the computer optimized around that, it presents ideas like the Mill computer's "belt", an interesting concept, albeit one that sadly still hasn't shipped anything.
@kevinrobben1999 Před rokem ⁺²
Learned so much from your videos! Thank you, subbed
@Diamonddrake Před 11 měsíci ⁺¹
Very well explained!
@evanfreethy2574 Před rokem
This is my favorite type of video 🎉
@dimitar.bogdanov Před rokem ⁺²
Really interesting!
@luna010 Před rokem ⁺²
great video. I'd be curious to see how your scripting language compares to VFX Forth (which is the fastest Forth compiler/interpreter afaik).
@cemgecgel4284 Před rokem
Great video!
@thomasmewily4012 Před rokem ⁺⁴
This is exactly the channel I needed : cleans, concises and really good comparison and explanation about Stack VM vs Register VM, thank you !
Just one question : When you have a stack VM, it is like having an infinite number of register. But when you are using a register VM, you usually get a limited number of register.
What happen if you use more variable than the number of available register, for a register VM ?
And how do the VM handle that, or even your computer if you compile it instead of interpreting it ?
@leddoo Před rokem ⁺⁴
the register vm in my scripting language is currently limited to 255 registers (one byte per operand).
so the compiler would raise an error (currently just panicks, lol).
i may add instructions to use more than 255 registers later, not sure.
python has a special instructions for when there are more than 255 local variables (docs.python.org/3/library/dis.html#opcode-EXTENDED_ARG )
in general, compilers tend to have all kinds of limits that you never notice during regular use (like max number of nested blocks, etc).
@thomasmewily4012 Před rokem ⁺¹
@@leddoo Yeah I was thinking about tricky/edges cases
Thank for the link btw ^^
@oblivion_2852 Před rokem ⁺²
Funny enough I came across the max number of arguments in a java function call because I was transpiling from a language that had arbitrary length arguments. It was a really dumb concat function that 700 arguments but that day I learnt that Java only supports 254 arguments (since the class itself is always passed as an argument)
@thomasmewily4012 Před rokem
@oblivion8459
Wow, I wonder in which context you manage to have a concat function with more than 700 arguments ? (Maybe the function call was written by another program)
Look like you reach some cool hidden technical details too. From memory in C it is at least 127 arguments.
@thomasmewily4012 Před rokem
@oblivion8459 The translated function can also have a single list or somethings iterable that contains the variadic arguments, but I guess it's just better to don't handle the case and see where the weird code is
@flareflo362 Před rokem ⁺³
What are the various profiling tools used here? They look really neat!
@leddoo Před rokem ⁺²
intel's vtune (on windows, i'm using the msvc integration)
and apple's instruments (for the macos stuff)
@Richard_Knusprig Před rokem ⁺³
I really like your editing style and how you ... aaahm ..., anyway.
@leddoo Před rokem ⁺¹
asufutimaehaehfutbscueme, anyway
@aredrih6723 Před rokem ⁺²
I'm more familiar with the JVM and in these stack machine, the locals (which include arguments) and the stack are kept separate.
The stack starts out empty and you have to load the value from the locals.
A bit as if there was a shelf alongside the stack and instructions to add the value of a local and instructions to save the current value to a local.
Basically, these stack machine _have_ register, they just have 2 operation allowed on them.
I'm not sure how python does it differently or if I'm misinterpreting the visualisation.
So does python really load all the arguments at the bottom of the stack?
@leddoo Před rokem ⁺¹
yes, python puts the locals at the bottom of the stack frame and loads them to the top for operations. (i can recommend messing around with python's `dis` module to see how python implements language constructs in bytecode!)
from what i've read on wikipedia, the stack in the JVM is more of an abstraction though. the jit turns it into regular native register code. similar to how WASM is stack based, but wasm runtimes convert the code to native register code before execution.
@aredrih6723 Před rokem
@@leddoo Hum, I did a quick test by making an identity function:
def ident(x): return x
It compiled to "load_fast" and "return_value".
I used the command `python3 -m compileall` to get it compiled, removed the "load_fast" from the compiled function (couldn't found how to load arbitrary byte code) and added a nop (byte code 9; for opts and arg) after the return to keep alignment (no clue how the format works, just searched byte code matching the disassembly and replaced it).
Then I tried to load the module. It worked.
Then I tried to called the function. It's SIGSEGV.
I'll need to dig deeper to find out if my "unconventional" edit are the caused or if the function need the "load_fast" to put a value on the stack.
Or I might be completely misunderstanding what you mean by "stack frame". Maybe what you call "stack frame" is what java call " slot" and wasm "local".
@leddoo Před rokem
@@aredrih6723 yeah, load_fast loads the value to the top, return_value returns that value.
maybe it crashes, cause python does not expect the parameters to be popped?
by stack frame i mean the area of the stack, reserved for the function invocation. (en.wikipedia.org/wiki/Call_stack#Structure )
you know, actually i'm not sure whether parameters are on the stack in python (that's what i do in my VM). so maybe it crashes, cause return_value tries to pop from an empty stack?
@aredrih6723 Před rokem ⁺¹
@@leddoo So, I tried digging a bit and dis.code_info seems to confirm the existence of locals (which would be register only supporting writing to and reading from the stack).
I also seem python encode the (I assume maximum) size of the stack and numbers of locals.
(also, "def f(a, b) : return 0" is marked as having 2 local and a stack of 1)
Now, storing the arguments at the bottom of the stack is a possible implementation but you would either need to copy an element from an absolute index or every loading depends on the size of the current size of the stack which would need to consider the size of the stack in both case.
Still, it seems to allow for some neatly optimized code so far so I'm curious how the vm will mature.
@leddoo Před rokem
@@aredrih6723 in my vm, the call instruction currently takes a variable number of register operands & copies those into the newly created stack frame (so params are the bottom `n` registers upon entry, but the callee is free to modify them).
@otesunki Před rokem ⁺¹
this makes me curious
is it faster to push a dummy value like 0 then add the top two elements and copy over the dummy value
@leddoo Před rokem ⁺²
hmm, not sure what you mean 🤔
my takeaway from that video was: if two "instructions" access the same memory location, they can interfere. (a "read after read" is fine, but "read after write" causes delays)
@nahblue Před 11 měsíci
Nice! Does anyone have a link for the pipeline diagram tool?
@leddoo Před 11 měsíci
link in the description :P
framework is motion canvas.
script is not ported to latest version (of mc) however.
@blinking_dodo Před rokem ⁺²
This is interesting...
@hfspace Před rokem
on the other hand these smart instructions take less place in the cpu even if they are dependent, since they are only occupying one "execution" slot. So in principal you have more capacity to execute other completely independent instructions. This seems to be a trade off here or am i missing something?
@hfspace Před rokem
i guess the product of taken execution slots and the time needed to execute is needed to compare different instructions absolutely.
@treelibrarian7618 Před 11 měsíci
pretty sure the reason the "smart" version is slower is the rotation - since that would have to read and write every entry in the stack memory to shuffle them down by one. better would probably be:
Load { src: a },
Load { src b },
Store { dst: a },
Add,
@u9vata Před rokem
I understood this before the vid being hiperf coding guy... But I am still not totally sure how good an optimizer can optimize stack based code to avoid these. I used to like Factor for example and although I never used it for hiperf code like I did with C++ they claim to have stack based good optimizer...
I mean for example a stack base language optimizer could spot that your load + dup pattern could be exchanged with the two separate load - I see no issue why the optimizer could not have this rule for example and this is just an example.
Generally informative video though. I did not know about the tool that shows that "green / red" thing for pipelining. What is the name of it? Ah never mind... its vtune. I prefer perf and such tools generally.
@bernatrosello4375 Před rokem
I don't think the "parallel" vs sequential argument is necessarily right. Even if you're running both versions using more/less variables you have register usage risks, which are going to slow yo down significantly on a segmented data path CPU.
@leddoo Před rokem ⁺¹
hmm, which part of the video are you referring to? and which registers - cpu regs or vm regs?
@elarajade7893 Před rokem ⁺¹
First ✌
@OhadLutzky Před 11 měsíci ⁺¹
That is super cool, and seems highly relevant to Java as well, which people like to at least pretend can be performant. (JVM is stack-based, Dalvik is register-based)
@leddoo Před 11 měsíci ⁺²
well kinda :D
wasm is also stack based, but close to native in perf.
the question is whether the *interpreter* is stack based.
java is typically jitted, so it runs natively using the physical registers.
@nexovec Před rokem
I'm pretty sure not padding every stack frame with 200 bytes helped in making your interpreter faster than python 😂
@themcchuck8400 Před 10 měsíci
You really need to work on your audio levels. They're all over the place. Keep the microphone a steady distance from your mouth for the entire session, and keep your mouth pointed at the microphone.
@HXMCPP Před 8 měsíci
what do you use ? to benchmark it ? 6:44
@leddoo Před 7 měsíci
i use vtune on windows (the msvc extension) and instruments for macos.
@HXMCPP Před 7 měsíci
does it work with amd ? @@leddoo
@leddoo Před 7 měsíci
@@HXMCPP i don't think so. amd has "amd uprof", but i'm not familiar with it.
@kfsone Před 11 měsíci
For some reason people look at dis output and always miss the elephant in the room.
2 0 LOAD_FAST 0 (a)
2 LOAD_FAST 1 (b)
4 BINARY_ADD
6 LOAD_FAST 2 (c)
8 BINARY_ADD
10 LOAD_FAST 3 (d)
12 LOAD_FAST 4 (e)
14 BINARY_ADD
16 BINARY_ADD
18 STORE_FAST 5 (r)
You're looking a the center column and not the far right.
2 0 LOAD_FAST 0 (this_couldnt_possibly_be_slower)
2 LOAD_FAST 1 (could_it)
4 BINARY_ADD
6 LOAD_FAST 2 (just_changing_the_variable_names)
8 BINARY_ADD
10 LOAD_FAST 3 (couldnt_make_it_slower)
12 LOAD_FAST 4 (right)
14 BINARY_ADD
16 BINARY_ADD
18 STORE_FAST 5 (i_mean_theyre_just_hash_lookups_arent_they)
def f(a, b, c, d, e): r = (a + b) + c + (d + e)
%timeit f(007, 14, 21, 27, 351)
from os import *
from sys import *
from collections import *
from itertools import *
from functools import *
def well_what_about_this(this_couldnt_possibly_be_slower, could_it, just_changing_the_variable_names, couldnt_make_it_slower, right, i_mean_theyre_just_hash_lookups_arent_they):
i_mean_theyre_just_hash_lookups_arent_they = (this_couldnt_possibly_be_slower + could_it) + just_changing_the_variable_names + (couldnt_make_it_slower + right)
%timeit well_what_about_this(007, 14, 21, 27, 351)
(bonus points if you figure why the imports)
@kfsone Před 11 měsíci
In my testing, "42 + 27 + 351" takes 25ns. calling "adder(42, 27, 351)" takes 75ns. changing adder to "adder(first, second, third)" and the body to just "result = (first + second) + third" adds 1.1-1.6ns to each call.
Well, 1.2ns on top of 76? No. 50ns of that is the overhead of PyCall, the actual instructions are only 25 + 1.2ns, or those few extra letters of the names accounted for 5% of the cost for that one line of 2 additions.
And *that* is just for top-level local parameters.
@kfsone Před 11 měsíci
czcams.com/video/vobGqYSvHCU/video.html
@leddoo Před 11 měsíci ⁺¹
i actually have no idea what you’re talking about 🤔
the length of the variable names doesn’t matter at runtime. they’re turned into indices. the last column is just for the programmer.
you seem to make some other point about call overhead, which is something that can matter, but in this video, the function call overhead was insignificant. and there wasn’t any python in this vid.
i’m just very confused.
@0LoneTech Před 11 měsíci
Those indices are the second rightmost column (outside the parenthesis) and the reason the accesses are labeled "fast". The imports certainly increase the odds of hash collisions, but it only affects the function lookup. Long names take (usually insignificantly) more time to parse, hash and intern (all during compile time), but once interned can be matched by id.
You might want to rerun those timeit tests a handful of times; in my tests the differences are entirely noise and swing either way.

Další v pořadí

Automatické přehrávání

it took me 1 month to fix this compiler bug