Open Assistant Inference Backend Development (Hands-On Coding)
Vložit
- čas přidán 23. 02. 2023
- #ai #huggingface #coding
Join me as I build streaming inference into the Hugging Face text generation server, going through cuda, python, rust, grpc, websockets, server-sent events, and more...
Original repo is here: github.com/huggingface/text-g...
OpenAssistant repo is here: github.com/LAION-AI/Open-Assi... (see inference/)
Check out www.wandb.courses/ for free MLOps courses!
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
CZcams: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n - Věda a technologie
Incredibly useful stuff! Sadly, probably it will not generate crazy views, but for people who do want to learn how to build stuff, and not just train models all day, this provides so much condensed knowledge. Thank you, man!
I want more coding videos, the more the depth, the more I don't feel completely lost and left out, updates frequently are very nice whether or not they are major. for youtube algo you can just pretend they are major
Thank you so much for these development videos!
They're really motivating to start working on contributing to the project.
Incredible work! Thank you and all the contributors. 💜✌🏼
Awesome work. Thank you for this. An emphatic yes to your question regarding making more of these types of videos down in the plumbing.
Oh cool, been wondering about the inference side of things
Yannic, open-assistant project is amasing! I will be help to collect data every free minutes
hi yannick, ich kommentiere selten, aber muss dir hier ein lob für dein video und das dahinterliegende projekt machen - die aufarbeitung der entwicklung - also beispiel an der praxis ist ein wirklicher mehrwert und sehr stark umgesetzt
These advanced coding videos are helpful
LFG OpenAssistant, hype is real
awesome work!
Great video Yannic, today I was able to learn a lot on designing inference pipelines, but I also learned that you watch Anime
Awesome! I wonder if streaming would also make sense for search applications...
Yes. more coding videos!
Amazing 👏🏻
Shut up and take my support 💙
Thank you
super awesome. Pretty amazing idea to have a "Folding@Home" style community-hosted ChatGPT.
Hey nice video yannic. Why is it that the Python server only generates one token per time instead of producing the whole sequence directly? I could see how continuous calls to the model is slower than requesting the whole seq.
I love this content. Is it possible that we can recreate what you have done on a totally private server (including the inference api) and without depending on any cloud service?
this is amazing! I would only wish to ask for timestamps
Wow!
What`s happening in AI/ML world today.😮Yannis will keep me updated.🥳😎👍
LLaMA just got released! What does everyone think, could it be used as OpenAssistant backend?
Is it out? No wait, this is still cool.
You could check the git log or look for tags.
😍
Apparently a 13B LLaMA is more capable than chatGPT, I would like to see if with sparsity and distillation Open Assistant can beat chatGPT with a even less parameters. Maybe 1B?
I have tried without finetuning it is not as good
13b is pretty good for probably many use cases but I think ChatGPT just will win due to pure power they have backing it. Problem is the power comes with high $$$ so you have to balance what you are willing to spend vs accuracy. I have 13b llama2 running on my local PC on a RTX 3080 and it does very well but its not fast and sometimes can get a little weird.
Oh wow, I was just wondering about splitting the ai into local/server portions in the discord.
Interesting work, Yannic, but wouldn't it be easier to use a Huggingface pipeline, and iterate autoregressive requesting one token at a time and extending the prompt by one token at a time? If this is a personal assistant, why do you need a Redis server and remote procedure calls? Are you building a web server for the public to use? Will this live on more than one system?
I think the overall goal is to have it be more distributed but I could be wrong.
Won’t having the request be in the data class not matter. The request still exists and python doesn’t copy the object does it? I thought it was ref counted
Great video. Thanks! One quick Tipp, instead of navigating the entire Explorer tree, Shift +F12 will help to find all related folders functions classes etc in vs code. Just Highlight the Part you want to search and Hit Shift F12. More shortcuts czcams.com/video/dI34jrEtmB0/video.html...
Why using Redis and not RabbitMQ?
What is the music at 10:30?
Would my Intel GPU work?
Use Julia. It’s easy like Python, Fast like Rust, and great for AI
😸🐣
lul 'websocket is very modern'. It is basically just a tweak on winsock (early 90s) which was based on berkeley socket from 1983!
😲
SSE?
Why does your traceback look so nice??
Found it. Its a library called rich.
These pipelines are rather convoluted.
Competition.
like_count_for_this_comment = 0
while True:
for like in range(500):
like_count_for_this_comment += 1
print("Thank you Yannic")
if like_count_for_this_comment == 500:
print("I guess I have thanked you enough...
Even though I didn't understand shit about advanced programming :)")
print("
Keep up the good work!")
break
What’s with the editing
Are you pretty much done with reviewing papers? There have been a few really good papers about reducing parameter counts without compromising performance using MultiModal models.
5:02 the model can't be written in python, its only declared in python, its probably TensorFlow or something underhood.
When I needed to use the deep-speech-2 I extracted the pre-compiled TensorFlow model and used it directly from C++, as I didn't want to embed python into my library.
To 'embed python' you would have to load the entire interpreter into your runtime and deal with the inherent limitations of python to execute efficiently in a multithreaded environment.
@@Sven_Dongle yeah, but that was not really the problem, as most of the pytorch is really control flow, Tensorflow itself runs directly on hardware.
The thing is loading the precompiled model, so python becomes kind of unnecessary. So there's that.
The real problem is conflict of dependencies between what was needed to run python and Tensorflow and the rest of my architecture.
So I started stripping down dependencies, python had to go.
I also ended up moving from bazel to buck, because I was translating bzl/python to cmake.
I could go for TensorFlow lite, but I really didn't want to debug a machine learning model.
lol dude
let me know if you need help lmao
The architecture looks too convoluted for passing chars around. Maybe because compromise between of speed of development and good architecture, but looks like it is becoming a mess. Cheers and thanks for the video Yannic
yea honestly the pythonification and rustification of AI has made it hard to break into for a lot of people. Rust aint so bad, but python. Geez. Like powering a supercomputer with potato
Explain, please
@@LucasDimoveo It cant even do multithreading. Python is forever cursed by the global interpreter lock, it can never scale to take advantage of modern multicore processors except via interfacing with native runtime libraries.
@@Sven_Dongle yes, and as a consequence the performance scores are abysmal. python is great for people who are new to programming, or only need programming as a tool. but clearly, we are in need of a unified core that works for all languages. tensorflow sadly does not cut it
@@ILLUMINATED-1 There are ways to make Python faster. Way faster. Yes, it will never be as fast as something written entirely in C but still, the ease of use and speed of development make it very attractive for scientific tasks, people say. Some Cython and Numba and most problems are gone. The GIL becomes... a little pebble in the shoe.
This all sounds like such an over-engineered and prematurely optimized solution. Why bother with huggingface and batching and stuff? I would kind of understand if you were building a closed/paid service, but OpenAssistant is supposed to be open source, right? So most people would self host it.
In my experience, "online" batching during inference is very rarely worth the engineering effort.
Part of the reason could be that the next step of training will require a web server that runs inference on the model to generate output that humans can rank in order of quality.
He mentioned that people will be able to host their own thing but I think there will be certainly massive multi user hosts of this app. Also most people dont even have the required hardware to run big LLMs
Maybe because large language models are too large for consumer machines.
The large version of BLOOM requires something like 640 gigabytes of RAM.
@@joshuascholar3220 it takes 640GB if you want to train it. We are talking about inference here. You don't even really need a GPU for inference.
Can you run it on a phone?
Yuck, why not use languages that can do real multithreading and scale, like Java and C++?
@Chirpy Banana Must have been designed by a polyamorous black muslim quadriplegic Aspergers sufferer.
No one is stopping you from creating your own version of this with your preferred language
@@demolicous Never said anyone was, midwit.
@@Sven_Dongle I believe demolicious' point was "You can write it yourself if you don't like it".
@@fluffio2976 Ah, good, I needed another microbrain to state that more plainly. Your fellow traveler dooshbags salute you.