Function Calling with Local Models & LangChain - Ollama, Llama3 & Phi-3

Sdílet
Vložit
  • čas přidán 1. 06. 2024
  • Code : github.com/samwit/agent_tutor...
    🕵️ Interested in building LLM Agents? Fill out the form below
    Building LLM Agents Form: drp.li/dIMes
    👨‍💻Github:
    github.com/samwit/langchain-t... (updated)
    github.com/samwit/llm-tutorials
    ⏱️Time Stamps:
    00:00 Intro
    01:27 Phi-3 Model Blog
    02:01 Gorilla Paper
    02:12 Function Calling Leaderboard
    03:00 Code Time
    03:02 Set up Llama 3 with Ollama
    03:45 Set up Prompt Template
    05:31 Get a JSON Output from Llama 3
    08:50 Get a Structured Responses using Ollama Functions
    11:44 Phi-3 Model Demo
    12:57 Tool Use and Function Calling Sample
    15:09 Trying Tool Use and Function Calling with Phi-3
  • Věda a technologie

Komentáře • 65

  • @jonm691
    @jonm691 Před 6 hodinami

    Great video, very informative, and filled some gaps. Thank you

  • @OliNorwell
    @OliNorwell Před 24 dny +14

    I would recommend adding 'Langchain' to the title of the video, most of this is very langchain specific, for those specifically searching for that.

    • @samwitteveenai
      @samwitteveenai  Před 24 dny +5

      Very good point. Added! Thanks!

    • @marilynlucas5128
      @marilynlucas5128 Před 23 dny

      @@samwitteveenai Great experiment you're running there but please consider using lm studio's new cli as well in your subsequent videos instead of ollama all the time. Also can you try using Anima's air llm library so you can run the llama 3 70B locally using layered inference?

    • @samwitteveenai
      @samwitteveenai  Před 23 dny +1

      I haven't heard of Anima's air llm library but will check it out

    • @theh1ve
      @theh1ve Před 23 dny

      Lm studio isn't as 'open' as ollama so would restrict the use cases to just personal use.

  • @jiyuhen
    @jiyuhen Před 24 dny +3

    Thank you for doing this with Ollama, this was an really good explanation and helped me a lot!

  • @rickymehra1104
    @rickymehra1104 Před 22 dny +3

    Thank u sharing video of ollama with phi3 to run locally, hope u would come up wid more such videos to use ollama locally for different tasks. Pls mk more videos on phi3, llama3 with ollama.

  •  Před 24 dny +4

    Excellent as usual! For Phi3 3.8b latest it works fine with:
    prompt_phi = PromptTemplate.from_template(
    """{context}
    Human: {question}
    AI:"""
    )
    Otherwise you will get validation errors.
    All the best Sam!

    • @gregorychatelier2950
      @gregorychatelier2950 Před 9 dny

      I got validation errors both with llama3 and phi3. Worse, the LLM was answering wrong, it returned Alex.
      Changing the prompt solved it. I tried Mistral v0.3, and it works too.
      Sam, I wonder where you found the recommanded prompt formats ?
      Also I would appreciate a video on how you handle validation errors as they may occur from time to time.

    •  Před 8 dny

      @@gregorychatelier2950 Hello it seems that the Llama3 prompt format had been altered a bit a few weeks after model's release (Reddit). To be double checked...

  • @chriskingston1981
    @chriskingston1981 Před 24 dny

    Ah really needed this, I kept feeling, I want to learn function calling with llama3. Feels so good to use a local model with function calling, and langchain made it really easy to do. Love to experiment with it now, thank you so much for this video❤️❤️❤️ and thanks to langchain for making it easy to do function calling ❤️❤️❤️

  • @andyma1146
    @andyma1146 Před 24 dny +1

    Thanks for the video! I'd like to see an example of using DSPy to optimize a local model so that it can use tools more reliably. I'm actually not sure if this would work but I'd like to find out. 😃

  • @sven262
    @sven262 Před 23 dny

    Thank you so much. Super helpful.

  • @aa-xn5hc
    @aa-xn5hc Před 24 dny

    Amazing video 🙏🏻
    Currently using crewai

  • @mshonle
    @mshonle Před 24 dny

    For local models I’ve found it’s helpful to at extra context at the very end of the prompt, in the assistant reply section (not the instruction section), kicking things off with “Sure, here is your JSON:” and then adding markdown syntax for preformatted text and then letting one of the end symbols be the final three backticks to close the markdown. It’s also helpful to write a custom grammar (like with llama.cpp) to constrain output to a specific schema even. (Depending on your setup this could slowdown inference if the constrained generation part isn’t running on the GPU.)

  •  Před 13 dny

    Very useful, thanks!

  • @eduardovernier7628
    @eduardovernier7628 Před 24 dny

    Very cool! I've been using the instructor library with pydantic for structured output and had a lot of success on openai models, but it didn't work very well with local llms. I'll definitely try out your approach!

  • @hienngo6730
    @hienngo6730 Před 24 dny +1

    Thank you for the informative videos as always. One note: if you want to run things all locally and want a lot better throughput, running the models using vLLM and serving the API with vLLM's OpenAI-compatible server is definitely the way to go. If you have a 24 GB VRAM GPU like a 3090 or 4090, you can run a GPTQ or AWQ quantized model, or just the full FP16 model and serve a large number of concurrent clients. With batching, you can get thousands of tokens per second in aggregate for responses if you run a lot of parallel clients.

    • @jay-dj4ui
      @jay-dj4ui Před 24 dny

      linux only, and I am not sure it has enough performance like that. Multiple API calling contiusely sounds great. just not sure....

    • @marilynlucas5128
      @marilynlucas5128 Před 23 dny

      You can run the llama 3 70b model with as little as 4gb gpu using Anima's air llm library which enables layered inference.

    • @hienngo6730
      @hienngo6730 Před 23 dny

      @@marilynlucas5128 I've never used this library before, what kind of tokens per second speed can you get? For reference, using LLaMA-3 70B with exllamav2 quantization at 2.4bpw on a single 4090, you can get around 36 tokens/second. With 2x4090s and 5.0bpw quantization, you get around 18 t/s.

  • @alx8439
    @alx8439 Před 24 dny +2

    The biggest issue with function calling is that the way everyone suggests to use it is not very viable / economical, if you want your model to choose one out of many functions to call. I'll elaborate: in order for LLM to pick a function to use, you need to announce all those tools in advance and make sure it hasn't forgotten them, if you're going into multy turn chat. This means more context will be used just to make model aware about all these extra tools you want it to use and less context will be available for responses. There's probably some semantic router needs to be introduced in-between to give model only those tools which might be relevant to current question

    • @brianmorin5547
      @brianmorin5547 Před 19 dny

      100% my experience as well. In fact, I’ve only had success doing function calling by putting it at the individual run level rather than at the model level and only calling a single function that will be needed

    • @tonyrungeetech
      @tonyrungeetech Před 18 dny

      I have a video doing exactly this with a library called semantic router and crew-ai!

  • @MeirMichanie
    @MeirMichanie Před 22 dny

    Thanks for the code and the explanation. In order to be usable, you should be able to execute the function feed the info back into the history of the conversation with the result of the function and then the llm should be able to use the results from the function to write the last message.
    For instance, lets say that the weather tool responds with just the temperature and nothing else, then the LLM should be able to respond back 'in Singapore the current temperature is ..." and in the same language as it was asked from the user.

    • @superstippi
      @superstippi Před 11 dny

      Absolutely agreed. It seems to be very hard to find information on how to do exactly that. The Phi-3 chat template doesn't seem to introduce a dedicated role for a function call result. So if it seems to be the "user" replying with a function call result, why would the model figure that it needs to phrase that into a coherent message? Also, I fail to get sensible output when there is more than one function declared and the model is supposed to be free to use a tool or reply directly. Often, I get long chunks of what appears to be training data appended to the initial reply.

  • @kenfink9997
    @kenfink9997 Před 23 dny

    Great video as always! In future videos, could you please show how to do this with Ollama and langchain running on separate computers? I'd like to develop on Laptop or Colab with just inference running on my Desktop PC. And since Ollama doesnt currently do API keys, how do we secure the inference server and access it from a Colab notebook?
    Thanks!!

  • @Shiroikage98
    @Shiroikage98 Před 24 dny

    would love if you can explain this using the ollama python package. As someone else said this is very specific to langchain and i just cant find good information on how to use function calling with ollama.

  • @comfixit
    @comfixit Před 24 dny

    I have found Phi-3 truly impressive for its size, getting good results even for general inquiries. I almost wonder if you could just use Phi-3 if you don't need a super refined response. It's so light on resources comparatively for an LLM.

    • @samwitteveenai
      @samwitteveenai  Před 24 dny

      Agree it is a nice model especially when you consider its size

  • @CraftPit
    @CraftPit Před 24 dny

    Phi3 excels at creative language tasks, surpassing even GPT-4 in my tests. GPT-4 itself ranks Phi3's lyrics higher :)

  • @Carnivore69
    @Carnivore69 Před 22 dny

    Great video. I was hoping this would give me a reason to try LangChain vs my own prompt/post-parsing for a web ui, but I'm actually getting better results than this demonstrates. I'm using llama3-8B via LM Studio. I think until these guys get their sh*t together and create a standard for output, this is going to be similar to the browser wars (standards). At the very least, they should all conform to current markdown standards or accept a config/spec for default output. Whoever comes out with an open source competitive model that does this is going to be the clear leader... for me anyway.
    ...And if such a model exists, please point me to it!! :)

  • @MukulYadav-pw9se
    @MukulYadav-pw9se Před 23 dny

    wow Sam!!!, this video is really helpful but i am facing challenge in running it on server as the response is not coming within 1 min and i am getting 504 Gateway Timeout error, i have used ollama docker image to install ollama but i am not able to find how to increase gateway timeout to 10 mins instead of default 1 min.
    Can you please help if you have faced such issue?

  • @svenvanwier7196
    @svenvanwier7196 Před 10 dny

    I see you use a mac mini, could you talk more about what model and OS setup?
    Thinking of fun things to do with my 2011 2ghz i7 16gb ddr3 ram, a local something on my network if I could.

  • @sumanthbalakrishnan285

    How do I incorporate function calling with follow up questions and memory. Say a user asks “what is the weather”. The model should be able ask “what place are you requesting for” and say the user replies “California”
    It should then make the function call with the mentioned arguments. Please let me know which direction I should look in order to achieve this.

  • @user-iu5ue4bv8q
    @user-iu5ue4bv8q Před 22 dny

    Thanks for sharing this, how can I use this json output funcution call format to combine the langchian agent functuion call framework , which. Use the llm.blind_tool to replace the llm=ChatOpenAI()? Will this work? Thanks

  • @jay-dj4ui
    @jay-dj4ui Před 24 dny

    Hi<
    Is that because we try to give it as much more accurate and better machine-readable input, so the model does not have to 'think' too much that it can follow the correct format like JSON and some basic function, and it can meet some complex requirements also. The way is more efficient and energy-saving.

  • @kallebysantos5167
    @kallebysantos5167 Před 24 dny

    Is possible to fine tune a small language model for function call?
    For example, if we look to BERT models that perform zero-shot classification we can pass a set of labels to it, so maybe is possible to use a similar approach to get a very performatic model just for function calling, since LLMs are very huge and almost every time requires a GPU. I know that phi3 is very small but in my machine it takes like 3Gb of GPU.

    • @samwitteveenai
      @samwitteveenai  Před 24 dny

      Yes very possible to do the key is getting the dataset and most people aren't making their datasets for this public.

  • @pensiveintrovert4318
    @pensiveintrovert4318 Před 24 dny

    I have been running gpt-pilot with Llama3-70b-instruct.Q5_K_M for a couple of weeks. The biggest problem I have, as far as I understand, is not function calling but rather the stability of the framework. It starts developing a bunch of files, but when I provide feedback, it may abandon the old files instead of correcting them, and starts creating a new set of files. Basically makes a mess.

  • @harshkesharwani8730
    @harshkesharwani8730 Před 12 dny

    How to use chatOllama along with function calling. i want to pass messages along with functions same as open ai v1/chat/completions api provides.

  • @kaushiklade
    @kaushiklade Před 20 dny

    Hey, thats very helpful to understand how to run these models locally.
    Can u/anyone tell, how to actually do actual function call and pass that response to llm? Is it possible without LangGraph???
    I want llm to decide which tool to call, once he decide that, llm should do entity extraction and then invoke tool, then returns ans back to llm and gives it to user. This was easy with AgentExecutor in OpenAI examples.
    Similar thing possible in Ollama?

  • @peterdecrem5872
    @peterdecrem5872 Před 24 dny

    What was the name of the paper that shifts the probabilities to get json as response more likely?

  • @alx8439
    @alx8439 Před 24 dny

    At last someone finds a good use for agents - to give them some tasks you want accomplished and give loose it free overnight to use internet :)

    • @willjohnston8216
      @willjohnston8216 Před 24 dny +1

      I don't understand how this demonstrated using agents overnight on the Internet? I'd really like to know how to do that. What did I miss?

    • @alx8439
      @alx8439 Před 24 dny

      @@willjohnston8216 Mr. Witteven just mentioned this as a possible implication. I was just glad more people to turn their minds into some real world use cases for agentic flows - like giving a topic for your agent and let it research it, find products / software, which you would never find in ads, do some data gathering and processing for you, providing helpful summaries on a hot topics you never have time to investigate properly yourself, etc etc etc

  • @madhudson1
    @madhudson1 Před 23 dny

    all looked well and good until you try feeding a question into the 'agent' that doesn't relate directly to: "get the current weather in a given location".
    I thought the whole point of function calling/tooling was to present the LLM with the opportunity to use tooling if necessary.

  • @AIvetmed
    @AIvetmed Před 24 dny

    has someone tried to load the models other than using ollama like the huggingface transformer pipeline or in other words I would love to know how torun these models in Linux based servers like databricks where I am unable run ollama application in the background like in my windows PC?

    • @MavVRX
      @MavVRX Před 24 dny

      Ollama already supports windows

    • @AIvetmed
      @AIvetmed Před 24 dny

      @@MavVRX for Linux based servers like databricks server

    • @samwitteveenai
      @samwitteveenai  Před 24 dny

      I made a Llama3 review deep dive video and show loading that in HF Transformers there in a colab

  • @RobBominaar
    @RobBominaar Před 24 dny

    Well, actually, where are the functions? I only see a Json string.

  • @harshkesharwani5621
    @harshkesharwani5621 Před 24 dny

    Can I use function calling with llama.cpp?

    • @samwitteveenai
      @samwitteveenai  Před 24 dny

      in theory yes but might need to mess with how to get it accept them etc.

    • @harshkesharwani5621
      @harshkesharwani5621 Před 24 dny

      How one can pass multiple functions and let model decide to use particular one. Does it supports multiple functions

    • @MavVRX
      @MavVRX Před 24 dny

      The bind function takes in an array of functions so you can simply add the additional functions to the array separated by commas. E.g. [f1, f2]

    • @harshkesharwani5621
      @harshkesharwani5621 Před 8 dny

      But how to use function calling along with chat message like user, system and assistance role

  • @StephenRayner
    @StephenRayner Před 24 dny

    You are not using latest version. It’s now called “bind” not bind_tools

    • @samwitteveenai
      @samwitteveenai  Před 24 dny

      I am using the latest langchain-experimental 0.58 the bind is used in the main function calling with prop models for the OllamaFunction they still have it as bind_tools. If I am missing something send me a link.

  • @Anthony-dj4nd
    @Anthony-dj4nd Před 9 dny

    This is like the reverse of crypto mining. Lol😅

  • @hightidesed
    @hightidesed Před 21 dnem

    very cool, but this is kind of useless unless you can mix text responses and function calling with the same prompt

  • @meca_p
    @meca_p Před 23 dny

    I hope you to make react agent tutorial with ollamafunction..!