Open Assistant Inference Backend Development (Hands-On Coding)

Yannic Kilcher

zhlédnutí 37 425

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 23. 02. 2023
#ai #huggingface #coding
Join me as I build streaming inference into the Hugging Face text generation server, going through cuda, python, rust, grpc, websockets, server-sent events, and more...
Original repo is here: github.com/huggingface/text-g...
OpenAssistant repo is here: github.com/LAION-AI/Open-Assi... (see inference/)
Check out www.wandb.courses/ for free MLOps courses!
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
CZcams: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Věda a technologie

Komentáře • 85

@panzach Před rokem ⁺⁹
Incredibly useful stuff! Sadly, probably it will not generate crazy views, but for people who do want to learn how to build stuff, and not just train models all day, this provides so much condensed knowledge. Thank you, man!
@flxp Před rokem ⁺²⁵
I want more coding videos, the more the depth, the more I don't feel completely lost and left out, updates frequently are very nice whether or not they are major. for youtube algo you can just pretend they are major
@firefly777a8 Před rokem ⁺³²
Thank you so much for these development videos!
They're really motivating to start working on contributing to the project.
@sozenoz Před rokem ⁺²⁸
Incredible work! Thank you and all the contributors. 💜✌🏼
@johnnypeck Před rokem
Awesome work. Thank you for this. An emphatic yes to your question regarding making more of these types of videos down in the plumbing.
@andrewm4894 Před rokem ⁺⁷
Oh cool, been wondering about the inference side of things
@user-ey2vv1dl3n Před rokem ⁺²
Yannic, open-assistant project is amasing! I will be help to collect data every free minutes
@dingdong6332 Před rokem ⁺¹
hi yannick, ich kommentiere selten, aber muss dir hier ein lob für dein video und das dahinterliegende projekt machen - die aufarbeitung der entwicklung - also beispiel an der praxis ist ein wirklicher mehrwert und sehr stark umgesetzt
@adithyashetty8717 Před rokem
These advanced coding videos are helpful
@d33w Před rokem ⁺³
LFG OpenAssistant, hype is real
@oncedidactic Před rokem
awesome work!
@adityay525125 Před rokem
Great video Yannic, today I was able to learn a lot on designing inference pipelines, but I also learned that you watch Anime
@florianhonicke5448 Před rokem ⁺³
Awesome! I wonder if streaming would also make sense for search applications...
@arthdh5222 Před rokem
Yes. more coding videos!
@vivienseguy Před rokem
Amazing 👏🏻
@Theguywithspectacles Před rokem ⁺³
Shut up and take my support 💙
@axtaxt2964 Před rokem
Thank you
@KastanDay Před rokem ⁺¹
super awesome. Pretty amazing idea to have a "Folding@Home" style community-hosted ChatGPT.
@manuelplank5406 Před rokem ⁺³
Hey nice video yannic. Why is it that the Python server only generates one token per time instead of producing the whole sequence directly? I could see how continuous calls to the model is slower than requesting the whole seq.
@glorified3142 Před 11 měsíci
I love this content. Is it possible that we can recreate what you have done on a totally private server (including the inference api) and without depending on any cloud service?
@sortysciaofiscia Před rokem ⁺⁵
this is amazing! I would only wish to ask for timestamps
@marilynlucas5128 Před rokem
Wow!
@digitalcontent1870 Před rokem
What`s happening in AI/ML world today.😮Yannis will keep me updated.🥳😎👍
@MenkoDany Před rokem ⁺⁴
LLaMA just got released! What does everyone think, could it be used as OpenAssistant backend?
@jackryan8588 Před rokem
Is it out? No wait, this is still cool.
@nathanbanks2354 Před rokem
You could check the git log or look for tags.
@productprofessor6080 Před rokem
😍
@vitorbortolin6810 Před rokem ⁺¹¹
Apparently a 13B LLaMA is more capable than chatGPT, I would like to see if with sparsity and distillation Open Assistant can beat chatGPT with a even less parameters. Maybe 1B?
@quantumjun Před rokem ⁺³
I have tried without finetuning it is not as good
@Thrashmetalman Před 8 měsíci
13b is pretty good for probably many use cases but I think ChatGPT just will win due to pure power they have backing it. Problem is the power comes with high $$$ so you have to balance what you are willing to spend vs accuracy. I have 13b llama2 running on my local PC on a RTX 3080 and it does very well but its not fast and sometimes can get a little weird.
@joe_limon Před rokem ⁺³
Oh wow, I was just wondering about splitting the ai into local/server portions in the discord.
@dr.mikeybee Před rokem ⁺²
Interesting work, Yannic, but wouldn't it be easier to use a Huggingface pipeline, and iterate autoregressive requesting one token at a time and extending the prompt by one token at a time? If this is a personal assistant, why do you need a Redis server and remote procedure calls? Are you building a web server for the public to use? Will this live on more than one system?
@Thrashmetalman Před 8 měsíci
I think the overall goal is to have it be more distributed but I could be wrong.
@suvalaki Před rokem
Won’t having the request be in the data class not matter. The request still exists and python doesn’t copy the object does it? I thought it was ref counted
@DrJohnnyStalker Před rokem ⁺²
Great video. Thanks! One quick Tipp, instead of navigating the entire Explorer tree, Shift +F12 will help to find all related folders functions classes etc in vs code. Just Highlight the Part you want to search and Hit Shift F12. More shortcuts czcams.com/video/dI34jrEtmB0/video.html...
@dominicbaumer6038 Před rokem
Why using Redis and not RabbitMQ?
@HUEHUEUHEPony Před rokem
What is the music at 10:30?
@Veptis Před rokem
Would my Intel GPU work?
@NickWindham Před rokem
Use Julia. It’s easy like Python, Fast like Rust, and great for AI
@LordNementon Před rokem
😸🐣
@Idiomatick Před rokem
lul 'websocket is very modern'. It is basically just a tweak on winsock (early 90s) which was based on berkeley socket from 1983!
@michaelparis6039 Před rokem
😲
@greendsnow Před rokem
SSE?
@azmilenario Před rokem
Why does your traceback look so nice??
@azmilenario Před rokem ⁺²
Found it. Its a library called rich.
@zyxwvutsrqponmlkh Před rokem
These pipelines are rather convoluted.
@shadow.banned Před rokem
Competition.
@bloodywolftr Před rokem
like_count_for_this_comment = 0
while True:
for like in range(500):
like_count_for_this_comment += 1
print("Thank you Yannic")
if like_count_for_this_comment == 500:
print("I guess I have thanked you enough...
Even though I didn't understand shit about advanced programming :)")
print("
Keep up the good work!")
break
@BombDog_ Před rokem
What’s with the editing
@halocemagnum8351 Před rokem
Are you pretty much done with reviewing papers? There have been a few really good papers about reducing parameter counts without compromising performance using MultiModal models.
@monad_tcp Před rokem
5:02 the model can't be written in python, its only declared in python, its probably TensorFlow or something underhood.
When I needed to use the deep-speech-2 I extracted the pre-compiled TensorFlow model and used it directly from C++, as I didn't want to embed python into my library.
@Sven_Dongle Před rokem
To 'embed python' you would have to load the entire interpreter into your runtime and deal with the inherent limitations of python to execute efficiently in a multithreaded environment.
@monad_tcp Před rokem ⁺¹
@@Sven_Dongle yeah, but that was not really the problem, as most of the pytorch is really control flow, Tensorflow itself runs directly on hardware.
The thing is loading the precompiled model, so python becomes kind of unnecessary. So there's that.
The real problem is conflict of dependencies between what was needed to run python and Tensorflow and the rest of my architecture.
So I started stripping down dependencies, python had to go.
I also ended up moving from bazel to buck, because I was translating bzl/python to cmake.
I could go for TensorFlow lite, but I really didn't want to debug a machine learning model.
@1PercentPure Před rokem
lol dude
let me know if you need help lmao
@StoianAtanasov Před rokem
The architecture looks too convoluted for passing chars around. Maybe because compromise between of speed of development and good architecture, but looks like it is becoming a mess. Cheers and thanks for the video Yannic
@ILLUMINATED-1 Před rokem ⁺¹
yea honestly the pythonification and rustification of AI has made it hard to break into for a lot of people. Rust aint so bad, but python. Geez. Like powering a supercomputer with potato
@LucasDimoveo Před rokem ⁺¹
Explain, please
@Sven_Dongle Před rokem ⁺¹
@@LucasDimoveo It cant even do multithreading. Python is forever cursed by the global interpreter lock, it can never scale to take advantage of modern multicore processors except via interfacing with native runtime libraries.
@ILLUMINATED-1 Před rokem ⁺¹
@@Sven_Dongle yes, and as a consequence the performance scores are abysmal. python is great for people who are new to programming, or only need programming as a tool. but clearly, we are in need of a unified core that works for all languages. tensorflow sadly does not cut it
@axolotron1298 Před rokem
@@ILLUMINATED-1 There are ways to make Python faster. Way faster. Yes, it will never be as fast as something written entirely in C but still, the ease of use and speed of development make it very attractive for scientific tasks, people say. Some Cython and Numba and most problems are gone. The GIL becomes... a little pebble in the shoe.
@ruroruro Před rokem ⁺⁷
This all sounds like such an over-engineered and prematurely optimized solution. Why bother with huggingface and batching and stuff? I would kind of understand if you were building a closed/paid service, but OpenAssistant is supposed to be open source, right? So most people would self host it.
In my experience, "online" batching during inference is very rarely worth the engineering effort.
@stephengriffith140 Před rokem ⁺¹
Part of the reason could be that the next step of training will require a web server that runs inference on the model to generate output that humans can rank in order of quality.
@alpers.2123 Před rokem
He mentioned that people will be able to host their own thing but I think there will be certainly massive multi user hosts of this app. Also most people dont even have the required hardware to run big LLMs
@joshuascholar3220 Před rokem ⁺¹
Maybe because large language models are too large for consumer machines.
The large version of BLOOM requires something like 640 gigabytes of RAM.
@ruroruro Před rokem
@@joshuascholar3220 it takes 640GB if you want to train it. We are talking about inference here. You don't even really need a GPU for inference.
@alpers.2123 Před rokem
Can you run it on a phone?
@Sven_Dongle Před rokem ⁺²
Yuck, why not use languages that can do real multithreading and scale, like Java and C++?
@Sven_Dongle Před rokem
@Chirpy Banana Must have been designed by a polyamorous black muslim quadriplegic Aspergers sufferer.
@demolicous Před rokem ⁺⁶
No one is stopping you from creating your own version of this with your preferred language
@Sven_Dongle Před rokem
@@demolicous Never said anyone was, midwit.
@fluffio2976 Před rokem
@@Sven_Dongle I believe demolicious' point was "You can write it yourself if you don't like it".
@Sven_Dongle Před rokem
@@fluffio2976 Ah, good, I needed another microbrain to state that more plainly. Your fellow traveler dooshbags salute you.

Další v pořadí

Automatické přehrávání

How to make your CPU as fast as a GPU - Advances in Sparsity w/ Nir Shavit