DIY 256-Core RISC-V super computer
Vložit
- čas přidán 18. 05. 2024
- Free Assembly for 1-6 Layer PCBs at JLCPCB, 3D Printing from $0.3, Sign up to Get $60 Coupons here: jlcpcb.com/?from=bitluni (Sponsor)
This new cluster build escalated quickly. Especially with the bugs I built in but here are some specs:
256x RISC-V 48MHz
17x RISC-V 144MHz
640x GPIO
256x ADC
17x 8-Bit bus
Combined single core clock rate would be 14.7GHz not that impressive but also not too shabby.
0:00 Supercluster recap
0:41 Intro
1:41 PCB Design and BU
2:36 JLCPCB
3:30 Assembly
5:14 First tests
5:48 BUS protocol fix
8:28 BUS tests
9:53 Conclusion
Tools and parts (affiliate links):
Preheating Station: aliexpress.bitluni.net/heatin...
Flux: aliexpress.bitluni.net/flux
Syringe Pusher: aliexpress.bitluni.net/pusher
Low Temp Solder Paste: aliexpress.bitluni.net/lowTem...
Tweezers: aliexpress.bitluni.net/tweezers
Edge Connectors: aliexpress.bitluni.net/edgeConn
CH32V003: aliexpress.bitluni.net/ch32v003
CH32V203: aliexpress.bitluni.net/ch32v203
Scope Siglent SDS1104-E: amazon.bitluni.net/siglent4
Digital Probe Siglent SLA1016: amazon.bitluni.net/siglent16
Github Sponsors: github.com/sponsors/bitluni
Patreon: / bitluni
Channel membership: / @bitlunislab
Paypal: paypal.me/bitluni
bitluni live: / @bitlunilive
Twitch: / bitluni
Mastodon: chaos.social/@bitluni
Twitter: @bitluni
Discord: link.bitluni.net/discord - Věda a technologie
Dude - use a foot-operated vacuum pen - much quicker & easier than tweezers!
I would like one! Any recommendations?
@@johboh I guess the Pixel Pump might be a good candidate. I have not used one myself, but the fact that it's an open project is a good thing. Of course there are cheaper and less capable options, but if you do regular board assemblies, buying a decent and a bit more expensive tool once will save you a lot of time and money over time.
I think your decision to not put everything on one big shared bus was the smart approach. Each input pin on the bus has a small amount of parasitic capacitance, which increases bus loading and requires additional drive current from the output pin driving the bus. That increases dI/dt which means more radiative EMI and crosstalk, and distorts the edges. This is less of a problem with an open drain setup, but still causes slower edge transitions and ringing. The long traces will have a lot of inductance which, left undamped, also tends to cause a lot of ringing. Longer traces also mean you're getting to the point where you're having to model them as transmission lines, since the Nyquist frequency of the design is set by the rise/fall time (not the clock!) and that's very fast on modern ICs - it's pretty common to see frequency components in the 300-800MHz range during transitions, so if you're running traces further than about 9cm you can no longer treat them as lumped lines. Once you get to this sort of scale you typically want to be using bus redrivers to break the bus up into smaller segments to avoid SI/EMI problems.
If you start finding that you have SI issues once you add all the boards, two things you can do are reducing the pullup resistor value and adding a small resistor in series with each IO line. Right now with 5.1kΩ pullups you've got that classic sharkfin shaped clock, where the pullup resistor takes a while to overcome all the parasitic capacitance on the board. You can speed that rising edge up by reducing that pullup resistance - bodging a second 5.1kΩ resistor on top will do that. The falling edge is very fast because the IO pins are actively pulling the bus to ground. This causes big dI/dt spikes at the falling edge, while all that charge stored in the parasitic capacitances rushes through the low impedance path created by the active low-side FET. You can moderate that dI/dt with a small value resistor (e.g. 22Ω) in series with each of the IOs, so the bus is still strongly pulled down but the current isn't controlled only by the Rds(on) of the low-side FET in the IO. Since you've already spun the boards this might be kinda tricky to add - maybe something for a rev2/3? :)
It also doesn't hurt that he left the "repetition" and modularity to the board coppies. Kind of made me think of repeating code where a loop should be implemented. It would be easier to maintain/rid of bugs, and left the mind numbing repetition to the manufacturing. Not to mention he can expand the cluster as needed.
It was fun meeting you in person during CCC last year. Strange to see you pop up in a comment section though.
@@modernsolutions6631 I've never been to CCC! EMF Camp 2020, maybe?
I love the random clock variations on the blink sketch. Fun source lf entropy.
It's one of the nightmare of electrical designer. Very hard to synchronize differents components at high speed.
It's also sensitive to temperature, so if you have a thermal gradient across the ICs you'll find that some drift faster than others.
@@gsuberlandheating up half of the boards sounds like a cool idea
@@king_james_official hot* idea
@@siz1700 ha ha ha!!! (with long pauses in between)
So awesome. IMO Fiasco would be a cool code name for a project or chip.
fiasco 256, that way there can also be a fiasco 10000
The L4Re Microkernel is named Fiasco.
Makes me wish I'd done electrical engineering at university. This level of dev is beyond my capability of simple analog electronics, I'm like a monkey with a spanner. Not enough time in the day now to reskill but your work is inspiring and why I'm subscribed.
Want an easy start? Watch Ben Eater videos! Start with the breadboard series, then the 6502!
@@thek3743 Ben eater is the GOAT. 100% great series. His 12(?) part networking series is also great.
Same here. I'm a software developer, so I don't have much time, but I've always been interested in electrical.
Not sure if you've done this already, but it might make sense for you to have a seperate "subnet" for each blade and then only send transmitted data on the inter-blade bus if the destination is outside of that subnet.
Dude, design GPU already
oh my god, you're reinventing the ethernet
@@monad_tcp that sort of sub-networked interconnect is common in CPU design as well
@@tophyr mfw everything is just ethernet
Can it run Doom?
sounds like a reasonable end goal
Watching your pick and place makes me want to both go into electronics and stay the heck away from it.
"your"??
At 10 pins free per 48mhz cpu, you could connect 20,040 leds (or 6 million if they are combined). Enough to make a small terminal screen...or play bad apple. With each pin handling 90 leds at 48Mhz, this thing would push pixels like a monster.
Just need the timing to be perfect.....
first ime ive seen tape and tray of parts being used, kudos. i did inkdot for a year because i loved the simplicity and focus it required. they moved me to pin refurbishing when they found out i could do it easily
Ever heard of the "transputer", a 1980s commercial computer made of a collection of thousands of tiny weak processors working in parrel for advanced scientific tasks.
Your cluster reminds me of it.
Retrobytes channel made a video on it several months ago.
we did basic programming on them in the 90's. used for fft audio processing
that sounds pretty much like a gpu with its shader units
Yep I was reading an article on the chips & cheese blog the other day about a Qualcomn mobile GPU & that's what I was thinking @destiny_02
Reminds me of TIS-100
that's how a modern video card works!!! they have thousands of units (they're called differently among gpu manufacturers) that run in parallel executing small programs called shaders, which (oversimplifying now) all determine the color of EVERY pixel on your screen tens of times a second
I am a software person and I built cards with my electronic partner 15 years ago that each card has three microchip processors that communicate with each other on the card in fast serial communication on pullup lines. These cards communicated with other similar cards for ranges of 10 km on a pair of cords that also transferred the energy for the needs of agriculture in the field.
That is a lot of CPU power for some random blinking LEDs :)
Amazing work, what a project! 😮👍
Wow just discovered. Awesome. Can't wait for the next!!
Many kudos for attempting such a "mega-project". No pain no gain...
Your message collision scheme is remarkably similar to how CAN. works. It seems you've independently discovered an excellent system. very impressive.
Kind of, but CAN has a priority system and allows the message of the highest priority transmitter to go through. This is especially important in automotive applications.
It's CSMA-CD. Used most commonly in 802.3 (commonly ethernet) communications.
This is the coolest thing I've seen in a while!
Cool project, thanks for sharing.
As always, an amazing project. The funky music for hand SMD assembly *almost* made it look enjoyable 😂
this is nuts! , i love it!
This is so awesome!
Awesome work 😮
that's going to be fun to program
Amazing video! you are teaching a lot of stuff with this.
I would use an active pullup (constant current source) on the bus with so many devices on the bus. It could be a current mirror with two P MOSFET transistors (e.g. BSS84). With 5 mA current, it would probably speed up the communications a lot.
You're going to run Game of Life on that thing, aren't you?
That or Bad Apple.
That or we get rickrolled.
I don’t know what it is, but I love it! More!
A true work of art!
My hats off to you! 🍻
I love it when blink goes out of sync... it looks like one of Big Clive's "supercomputers" except it really is a supercomputer!
It reminds me of the Lost in Space equipment in the early 1960s!
Great danger.
CSMA/CD reinvented :)
Thats HUGE!!
Great stuff
Heck of a great project! And custom CDMA!
Wow that's insane
Ein Jahr jeden Tag auf neue warten hat sich gelohnt 🥹🥹
after 2 minutes you already deserve a like!
This project reminded me about both the game of life automaton, KISS principle and CD part of CSMA/CD.
If you send a considerable amount of broadcast it makes sense to have a bit after the source address which is only set if it's a broadcast. So you can skip the target address completely.
This only makes sense if you send a lot of broadcast messages, as every unicast message is then 1 bit longer
sounds insane in performance, but actually would be roughly equal to 4 cores running at just over 3ghz due to the low clockspeed.
that said it does show it is possible, and if this works it will also work with much faster risk-v chips.
actually in some arm architectures the cores where designed to be kind of used like this so you could just keep scaling them, there was actually some 1000core arm cpu somewhere around 2013 or such, sadly never took of since back then mulithreading didn't really practically exist yet, as in that basically no softwares used it, and that things like handling large amounts of data at once wheren't a thing yet.
that said, risc-v is opensource, so it means it should be possible to actually make a risc-v cpu which directly combines tons of cores.
if you plan to make something like that I do have a better way for you to try out than using a single bus(or a few busses) since using busses like that can work but can have problems, I roughly designed a new experimental way of doing such multichip communication for the raspberry pi foundation some years ago, actually was to try and get them to make a board with way more cores. but essentially it is a method giving quite some bandwith but also large buffering and chips being able to get the data when they are ready instead of needing to accept it directly, that said, in some cases direct busses might be more usefull, luckily in a full cpu design you can make many more busses, both have advantages and weaknesses depending on the loads.
crazy! in a good way!
i just discover your channel an io immediately subscrtibed. this project mesmerize me. keep on!
Ok, but can it run Crisis?
I know people who would go over the edge for your random parts placement :) "All values of similar resistors have to face the same directions"... LOL. Nice one. Wish I had more time to join the livestreams again ...
how is your comment 2h old ? the video was uploaded 5min ago 🤔
@@valet_noir Patrons get early access, even this means only 2 hours in Butluni terms. Other CZcamsrs are a bit more generous here ;)
All values of all components must face the same direction!!! ;)
"...actually 273 but okay" is the best subtitle for a video in the history of the platform.
You are very close to the original Ethernet CSMA/CD protocol. The XOR checksum has the problem that two colliders can cancel eachother - two single bit errors could result in a correct checksum - making a packet "appear" good. As such Ethernet uses a CRC. Further, if you detect a collision you "jam" the whole packet with alternating ones and zeros to really mess it up and then do your randomized backoff. What you will find, and you are not the first, is that as you scale the collisions will increase and the bandwidth will be insufficient. The cores will be data starved. This was the case with the Intel MIC's (Knight's Corner). They used PCI-E but the issue is the same, multidrop and star topologies oversubscribe easily. You will note datacenters (home of enormous clusters) used leaf spine (and other) interconnects to mitigate this. But fun none the less. So you have a huge number of course - what will you do with it? What would others in the comments run?
You could use the now free command pin to sync all the clocks together
Nice!
Pretty project, thanks for sharing
You may want to decrease the resistance on your clock line. A slow rise time can cause one of the processors to miss a clock and become out of sync with the host.
GAME OF LIFE on this would be insaine
Really nice project. What are you using for the top view shots?
4:39
The blink looked like game of life 😂
You may be nuts, but that's much of the fun of watching. This project is a delightful sprawl, full of potential and hurdles. What do you want it to become, beyond the LED art? I mean, is there a target functionality or is the journey the goal?
Well, I guess we'll find out.
just discovered your channel and this is super cool! what was your career path that got you into electronics? thanks!
Very interesting and good job, but what's the next step?
Would you he able to upload those first streams in which you made the cluster and the protocol? It's not on twitch nor CZcams...
You can also add in a small fpga to make the to run or manage the cluster?
Hi, Is there a video on the tool chain for this uC ? Cheers !
let's game on it! :D
I've got two questions:
1) for what could this be used for?
2) for the waiting time after a collision, couldn't you use the ID itself as a delay? Or maybe force them to report in order, maybe using a master or calling the next one in line
if the collision detection is waiting a random time using the ID as the seed so they're always different, why not just use the ID as the amount of wait time directly?
I would mine so many moneroj with this
he's gone mad!
8:36 Holy rise time Batman!
The Signal Integrity engineer just started breaking out in a cold sweat
I have been planning to do this with sg2002s.
Despite missing half of the explanation bcs I have no idea what the used terms means, I nevertheless found everything fascinating. For me, its like our modern day version of an art painting. Can u tell me which kind of university degree/knowledge/skills are necessary for such project? And good job! 👍
Could u check the flux link? it also refers to the syringe page! thx!!
Your collision detection is very similar if not the same as CAN Bus collision detection (I need to check but I believe it’s at least close)
very gud 👍
😅 This is a kind of projects I really like watching, but I have a question, what can It really do beside some basic stuffs, anything like computing with a lot of cores ( that may be too hard 😮 ). In my opinion, this is an interesting project I love. Thank you for making the video, hope you have a great day 🎉🎉🎉
Exactly. I think that to blink a LED, some FPGA will beat x10 RISC-V by number of I/Os, speed, and by a price. It's interesting to have some idea what is it for, how much for one flop, what are alternatives in terms of a price, performance, so on.. It's like to build a cluster with Raspberry PIs, when you can take an i7 and save money and have much better performance.
I think this is an amazing achievement. I would love to see you demonstrate its speed with some "sha-1"cracking or comparison testing against a raspberry pi 5 and a mid range PC with a long duration 24hr minimum to see how far 17Ghz can go I a day
And then add some what are currently at the moment quite inexpensive RAM and Storage for BIOS?
this is really incredible. i'd love to see a collab between you and @beneater !!! Really great work.
What might be some of the use cases for the Megacluster?
Why fiasco? I thünk there is hard work and clever work arounds in this project. Keep your head up!
how is software development with the ch32 ic? i have been thinking of trying them out, but the "sdk" (or examples) looked really scary...
What an epic project! Subscribed
Probably dumb question, but could the collision detection be replaced by a queueing system where an mcu can request the bus then get serviced fifo? Maybe that would be slower.
Would be cool to see a map-reduce algorithm running on this beast.
That becomes a latency/bandwidth tradeoff... if you can request large chunks of dedicated time, you can shift bytes out at full speed, while both collision detection and turnaround (the system setting after currents potentially change direction on the backplane, also peak EMI) inherently slows down the timing required. Many systems use a fast clock with added guard intervals, clock cycles where nobody drives the bus.
super..., so what is it for? what you can implement on it?
epic gaming
What’s the music playing during assembly at 4:13?
I'm curious, is there a benefit to this type of bus communication versus an established protocol like ethernet, i2c, etc. Is it just simplicity or speed? Im a student and havent studied this area yet.
This is more or less an "established protocol"... strobed 8-bit data is exactly what the old DB25 parallel port carried to printers, before EPP or ECP. Components speaking it can be found all over vintage and modern electronics, from I2C I/O expanders (with the "clock" pulsed automatically when the port is updated), to the Intel 8255 (designed for implementing this type of parallel port from either end), to the standard components in digital logic simulators like Logisim, Falstad, or CircuitMaker.
Hows the timers and clock speed? Can it runa 60hz hdmi or vga
5:18 > We can hear you laughing. I like your enthusiasm
I feel like to avoid collisions it might be more consistent to simply have a synchronized counter across all the processors based on the clock, then index each processor uniquely, modulo the time counter by the number of processors, and then when the processor's number comes up from that modulo operator, you are allowed to send data. That way there is literally no way to ever have a collision (ideally). I'm sure it's a bit more complex than that. Of course, the amount of time it takes before any given processor can send data is anywhere between 0 and n, where n is the number of processors... Could be bad if there are a ton of processors... I suppose the random time approach is *potentially* faster, although any time I hear "random" in this sort of application I am a bit suspicious hahah.
The technique you're describing is called "Time-division multiplexing", or TDM for short. In its simplest form every node would get a fixed number of bytes to send during their turn, but a more complicated scheme could dynamically change the size of time windows depending on how much data a node reports it has available... moving bytes is the first step to building anything like that though.
Ah! Interesting! TIL hahah.
Did you just create a pretty good random number generator with those blinking leds? Looks much cooler than those lava lamps
nice project, but i have one question. i want to try this chip CH32V003 but can i use other swd debugger or it need to be e-link debugger?
SWD is a 2 wire ARM variant of JTAG (normally 4 or 5 wires), unlikely to appear on non-ARM chips. CH32V003 uses a different 1 wire debug interface they call SWD or SDI.
3:30 LumenPnP when? My hand and eyes hurt just watching all that placement! ( I probably just have a low tolerance though lol)
If someone port some RPC to this thing, it will be mind blowing thing
How IO bound are your processors? How much is left for compute in a real app? A benchmark would be interesting, something simple, like generate a random number for each proc, use the bus to determine each proc has a unique random number, then sort the numbers by proc Id.
based on clockspeed * cores, it would be roughly equal to a 4 core cpu at just over 3ghz in total compute power excluding the overhead due to the bus and such.
that said this is mostly because the speciffic cpu's used are around 2 times as fast as a arduino uno, so not really fast per core 0.048ghz aka 48mhz.
in a design like this it is possible to however use the same aproach with much more powerfull cpu's.
also it essentially is a open source cpu/computer, with such performance so quite grea, since a few years ago this would have competed with the high end desktop cpu's, even though there are modern day risc-v sbc's which are more powerfull.
@@ted_van_loon Your assumptions are classically erroneous. Because of things like Amdahls Law and cache coherence this architecture would be quickly swamped with bus messaging. In addition, there is no way to partition a uniprocessor app to such an architecture unless it is designed to be highly multithreaded and there is an OS that supports it. The only reason modern chips function so well is ultra high sped on-chip switching fabric between cores and advanced OS's that support a high degree of parallelism. Still, they will never beat a uniprocessor with the same aggregate clock speed and memory.
@@Sven_Dongle I know of those issues, I was reffering to potential compute capabilities, which indeed is heavily multithreaded so either using special software/compilers to make it more multithreaded, or using very well written software like Blender which actually can do multithreading very well.
I know how such a bus will easily get swamped, and it is a problem indeed, even if you get man busses, actually have once even designed some custom method for letting many chips communicate safely and fast without those issues(and without relieing to much on propetairy communication technologies.
it was actually designed to let high speed SOC's communicate with eachother, think about the chips you find on a raspberry pi, but then imagine such a SBC with many such chips on them.
while there are even better ways, the way I came with also was designed to not require the SOC's to be customized for it to work, so that it could easily be made and used. was actually meant for the raspberry pi foundation for more compute heavy chips for industry and home servers, sadly they still don't use something like that yet(which is understandable since that actually still requires going through a stage of experimental software and drivers and such.
luckily they did however take the IO chip suggestion in order to scale down the processor nodesize.
0:50 imagine the reveal of an ai being sentient by the blinking lights getting faster and faster and then it stops, and just starts showing text?
how many GBit/s is your outside communication? Do you use QSFP??
I would love to see a super cluster with the cluster and a 2040 I/O chip?
so i only understood like 20% of this. what is this for? whats the computing power? how does it compare to a "modern" equivalent?
So basically a parallel version of I2C?
What is mounriver studio?
now can this thing actually process data like a cluster?
Nice e-waste ! Congrats !
❤❤❤
I remember you had a pick'n'place machine, what happenes to it?