Open Assistant Inference Backend Development (Hands-On Coding)

Sdílet
Vložit
  • čas přidán 23. 02. 2023
  • #ai #huggingface #coding
    Join me as I build streaming inference into the Hugging Face text generation server, going through cuda, python, rust, grpc, websockets, server-sent events, and more...
    Original repo is here: github.com/huggingface/text-g...
    OpenAssistant repo is here: github.com/LAION-AI/Open-Assi... (see inference/)
    Check out www.wandb.courses/ for free MLOps courses!
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    CZcams: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Věda a technologie

Komentáře • 85

  • @panzach
    @panzach Před rokem +9

    Incredibly useful stuff! Sadly, probably it will not generate crazy views, but for people who do want to learn how to build stuff, and not just train models all day, this provides so much condensed knowledge. Thank you, man!

  • @flxp
    @flxp Před rokem +25

    I want more coding videos, the more the depth, the more I don't feel completely lost and left out, updates frequently are very nice whether or not they are major. for youtube algo you can just pretend they are major

  • @firefly777a8
    @firefly777a8 Před rokem +32

    Thank you so much for these development videos!
    They're really motivating to start working on contributing to the project.

  • @sozenoz
    @sozenoz Před rokem +28

    Incredible work! Thank you and all the contributors. 💜✌🏼

  • @johnnypeck
    @johnnypeck Před rokem

    Awesome work. Thank you for this. An emphatic yes to your question regarding making more of these types of videos down in the plumbing.

  • @andrewm4894
    @andrewm4894 Před rokem +7

    Oh cool, been wondering about the inference side of things

  • @user-ey2vv1dl3n
    @user-ey2vv1dl3n Před rokem +2

    Yannic, open-assistant project is amasing! I will be help to collect data every free minutes

  • @dingdong6332
    @dingdong6332 Před rokem +1

    hi yannick, ich kommentiere selten, aber muss dir hier ein lob für dein video und das dahinterliegende projekt machen - die aufarbeitung der entwicklung - also beispiel an der praxis ist ein wirklicher mehrwert und sehr stark umgesetzt

  • @adithyashetty8717
    @adithyashetty8717 Před rokem

    These advanced coding videos are helpful

  • @d33w
    @d33w Před rokem +3

    LFG OpenAssistant, hype is real

  • @oncedidactic
    @oncedidactic Před rokem

    awesome work!

  • @adityay525125
    @adityay525125 Před rokem

    Great video Yannic, today I was able to learn a lot on designing inference pipelines, but I also learned that you watch Anime

  • @florianhonicke5448
    @florianhonicke5448 Před rokem +3

    Awesome! I wonder if streaming would also make sense for search applications...

  • @arthdh5222
    @arthdh5222 Před rokem

    Yes. more coding videos!

  • @vivienseguy
    @vivienseguy Před rokem

    Amazing 👏🏻

  • @Theguywithspectacles
    @Theguywithspectacles Před rokem +3

    Shut up and take my support 💙

  • @axtaxt2964
    @axtaxt2964 Před rokem

    Thank you

  • @KastanDay
    @KastanDay Před rokem +1

    super awesome. Pretty amazing idea to have a "Folding@Home" style community-hosted ChatGPT.

  • @manuelplank5406
    @manuelplank5406 Před rokem +3

    Hey nice video yannic. Why is it that the Python server only generates one token per time instead of producing the whole sequence directly? I could see how continuous calls to the model is slower than requesting the whole seq.

  • @glorified3142
    @glorified3142 Před 11 měsíci

    I love this content. Is it possible that we can recreate what you have done on a totally private server (including the inference api) and without depending on any cloud service?

  • @sortysciaofiscia
    @sortysciaofiscia Před rokem +5

    this is amazing! I would only wish to ask for timestamps

  • @marilynlucas5128
    @marilynlucas5128 Před rokem

    Wow!

  • @digitalcontent1870
    @digitalcontent1870 Před rokem

    What`s happening in AI/ML world today.😮Yannis will keep me updated.🥳😎👍

  • @MenkoDany
    @MenkoDany Před rokem +4

    LLaMA just got released! What does everyone think, could it be used as OpenAssistant backend?

  • @jackryan8588
    @jackryan8588 Před rokem

    Is it out? No wait, this is still cool.

  • @productprofessor6080
    @productprofessor6080 Před rokem

    😍

  • @vitorbortolin6810
    @vitorbortolin6810 Před rokem +11

    Apparently a 13B LLaMA is more capable than chatGPT, I would like to see if with sparsity and distillation Open Assistant can beat chatGPT with a even less parameters. Maybe 1B?

    • @quantumjun
      @quantumjun Před rokem +3

      I have tried without finetuning it is not as good

    • @Thrashmetalman
      @Thrashmetalman Před 8 měsíci

      13b is pretty good for probably many use cases but I think ChatGPT just will win due to pure power they have backing it. Problem is the power comes with high $$$ so you have to balance what you are willing to spend vs accuracy. I have 13b llama2 running on my local PC on a RTX 3080 and it does very well but its not fast and sometimes can get a little weird.

  • @joe_limon
    @joe_limon Před rokem +3

    Oh wow, I was just wondering about splitting the ai into local/server portions in the discord.

  • @dr.mikeybee
    @dr.mikeybee Před rokem +2

    Interesting work, Yannic, but wouldn't it be easier to use a Huggingface pipeline, and iterate autoregressive requesting one token at a time and extending the prompt by one token at a time? If this is a personal assistant, why do you need a Redis server and remote procedure calls? Are you building a web server for the public to use? Will this live on more than one system?

    • @Thrashmetalman
      @Thrashmetalman Před 8 měsíci

      I think the overall goal is to have it be more distributed but I could be wrong.

  • @suvalaki
    @suvalaki Před rokem

    Won’t having the request be in the data class not matter. The request still exists and python doesn’t copy the object does it? I thought it was ref counted

  • @DrJohnnyStalker
    @DrJohnnyStalker Před rokem +2

    Great video. Thanks! One quick Tipp, instead of navigating the entire Explorer tree, Shift +F12 will help to find all related folders functions classes etc in vs code. Just Highlight the Part you want to search and Hit Shift F12. More shortcuts czcams.com/video/dI34jrEtmB0/video.html...

  • @dominicbaumer6038
    @dominicbaumer6038 Před rokem

    Why using Redis and not RabbitMQ?

  • @HUEHUEUHEPony
    @HUEHUEUHEPony Před rokem

    What is the music at 10:30?

  • @Veptis
    @Veptis Před rokem

    Would my Intel GPU work?

  • @NickWindham
    @NickWindham Před rokem

    Use Julia. It’s easy like Python, Fast like Rust, and great for AI

  • @LordNementon
    @LordNementon Před rokem

    😸🐣

  • @Idiomatick
    @Idiomatick Před rokem

    lul 'websocket is very modern'. It is basically just a tweak on winsock (early 90s) which was based on berkeley socket from 1983!

  • @michaelparis6039
    @michaelparis6039 Před rokem

    😲

  • @greendsnow
    @greendsnow Před rokem

    SSE?

  • @azmilenario
    @azmilenario Před rokem

    Why does your traceback look so nice??

    • @azmilenario
      @azmilenario Před rokem +2

      Found it. Its a library called rich.

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh Před rokem

    These pipelines are rather convoluted.

  • @shadow.banned
    @shadow.banned Před rokem

    Competition.

  • @bloodywolftr
    @bloodywolftr Před rokem

    like_count_for_this_comment = 0
    while True:
    for like in range(500):
    like_count_for_this_comment += 1
    print("Thank you Yannic")
    if like_count_for_this_comment == 500:
    print("I guess I have thanked you enough...
    Even though I didn't understand shit about advanced programming :)")
    print("
    Keep up the good work!")
    break

  • @BombDog_
    @BombDog_ Před rokem

    What’s with the editing

  • @halocemagnum8351
    @halocemagnum8351 Před rokem

    Are you pretty much done with reviewing papers? There have been a few really good papers about reducing parameter counts without compromising performance using MultiModal models.

  • @monad_tcp
    @monad_tcp Před rokem

    5:02 the model can't be written in python, its only declared in python, its probably TensorFlow or something underhood.
    When I needed to use the deep-speech-2 I extracted the pre-compiled TensorFlow model and used it directly from C++, as I didn't want to embed python into my library.

    • @Sven_Dongle
      @Sven_Dongle Před rokem

      To 'embed python' you would have to load the entire interpreter into your runtime and deal with the inherent limitations of python to execute efficiently in a multithreaded environment.

    • @monad_tcp
      @monad_tcp Před rokem +1

      @@Sven_Dongle yeah, but that was not really the problem, as most of the pytorch is really control flow, Tensorflow itself runs directly on hardware.
      The thing is loading the precompiled model, so python becomes kind of unnecessary. So there's that.
      The real problem is conflict of dependencies between what was needed to run python and Tensorflow and the rest of my architecture.
      So I started stripping down dependencies, python had to go.
      I also ended up moving from bazel to buck, because I was translating bzl/python to cmake.
      I could go for TensorFlow lite, but I really didn't want to debug a machine learning model.

  • @1PercentPure
    @1PercentPure Před rokem

    lol dude
    let me know if you need help lmao

  • @StoianAtanasov
    @StoianAtanasov Před rokem

    The architecture looks too convoluted for passing chars around. Maybe because compromise between of speed of development and good architecture, but looks like it is becoming a mess. Cheers and thanks for the video Yannic

  • @ILLUMINATED-1
    @ILLUMINATED-1 Před rokem +1

    yea honestly the pythonification and rustification of AI has made it hard to break into for a lot of people. Rust aint so bad, but python. Geez. Like powering a supercomputer with potato

    • @LucasDimoveo
      @LucasDimoveo Před rokem +1

      Explain, please

    • @Sven_Dongle
      @Sven_Dongle Před rokem +1

      @@LucasDimoveo It cant even do multithreading. Python is forever cursed by the global interpreter lock, it can never scale to take advantage of modern multicore processors except via interfacing with native runtime libraries.

    • @ILLUMINATED-1
      @ILLUMINATED-1 Před rokem +1

      @@Sven_Dongle yes, and as a consequence the performance scores are abysmal. python is great for people who are new to programming, or only need programming as a tool. but clearly, we are in need of a unified core that works for all languages. tensorflow sadly does not cut it

    • @axolotron1298
      @axolotron1298 Před rokem

      @@ILLUMINATED-1 There are ways to make Python faster. Way faster. Yes, it will never be as fast as something written entirely in C but still, the ease of use and speed of development make it very attractive for scientific tasks, people say. Some Cython and Numba and most problems are gone. The GIL becomes... a little pebble in the shoe.

  • @ruroruro
    @ruroruro Před rokem +7

    This all sounds like such an over-engineered and prematurely optimized solution. Why bother with huggingface and batching and stuff? I would kind of understand if you were building a closed/paid service, but OpenAssistant is supposed to be open source, right? So most people would self host it.
    In my experience, "online" batching during inference is very rarely worth the engineering effort.

    • @stephengriffith140
      @stephengriffith140 Před rokem +1

      Part of the reason could be that the next step of training will require a web server that runs inference on the model to generate output that humans can rank in order of quality.

    • @alpers.2123
      @alpers.2123 Před rokem

      He mentioned that people will be able to host their own thing but I think there will be certainly massive multi user hosts of this app. Also most people dont even have the required hardware to run big LLMs

    • @joshuascholar3220
      @joshuascholar3220 Před rokem +1

      Maybe because large language models are too large for consumer machines.
      The large version of BLOOM requires something like 640 gigabytes of RAM.

    • @ruroruro
      @ruroruro Před rokem

      @@joshuascholar3220 it takes 640GB if you want to train it. We are talking about inference here. You don't even really need a GPU for inference.

    • @alpers.2123
      @alpers.2123 Před rokem

      Can you run it on a phone?

  • @Sven_Dongle
    @Sven_Dongle Před rokem +2

    Yuck, why not use languages that can do real multithreading and scale, like Java and C++?

    • @Sven_Dongle
      @Sven_Dongle Před rokem

      @Chirpy Banana Must have been designed by a polyamorous black muslim quadriplegic Aspergers sufferer.

    • @demolicous
      @demolicous Před rokem +6

      No one is stopping you from creating your own version of this with your preferred language

    • @Sven_Dongle
      @Sven_Dongle Před rokem

      @@demolicous Never said anyone was, midwit.

    • @fluffio2976
      @fluffio2976 Před rokem

      @@Sven_Dongle I believe demolicious' point was "You can write it yourself if you don't like it".

    • @Sven_Dongle
      @Sven_Dongle Před rokem

      @@fluffio2976 Ah, good, I needed another microbrain to state that more plainly. Your fellow traveler dooshbags salute you.