Testing Microsoft's New VLM - Phi-3 Vision

Sam Witteveen

zhlédnutí 11 380

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 8. 07. 2024
In this video I go through the new Phi-3 Vision model and put it through it's paces to see what it can and can't do.
Colab : drp.li/L8iFS
HF: huggingface.co/microsoft/Phi-...
🕵️ Interested in building LLM Agents? Fill out the form below
Building LLM Agents Form: drp.li/dIMes
👨‍💻Github:
github.com/samwit/langchain-t... (updated)
github.com/samwit/llm-tutorials
⏱️Time Stamps:
00:00 Intro
00:40 Phi-3 Blog
01:49 Phi-3 Model Card
02:54 Phi-3 Paper
05:24 Code Time
05:44 Phi-3 Vision Demo
12:35 Phi-3 Demo on 4-bit
Věda a technologie

Komentáře • 36

@JonathanYankovich Před měsícem ⁺²
I would love to see a test with multiple images, where the first image identifies, say, a person by name, and then the second image has a picture that may or may not have the person in it, with a prompt, “who is this person and what are they doing” or “is this ____? what are they doing?”
Would be interesting for homebrew robotics explorations
@AngusLou Před měsícem
Amazing, thank you Sam.
@KevinKreger Před měsícem
Thanks Sam!
@sajjaddehghani8735 Před měsícem
nice explanation 👍
@xuantungnguyen9719 Před 29 dny
Amazing as always. What are some better models (both open or closed source)? Thanks Sam
@MukulTripathi Před měsícem ⁺¹
9:15 it actually got the sunglasses the first time itself. If you read it again you'll see it. You missed it :)
@RobvanHaaren Před měsícem ⁺⁵
@Sam, it'd be great if you could do a video on LLM costs... How much are you paying monthly in OpenAI, Gemini and others' usage with the work and experimentations you do? And what are best practices to control costs? Do you set limits? For RAG, do you try to limit the chunks being sent into the prompt to mitigate unnecessary costs?
Or do you prefer to run open source models like Llama3? And if so, do you run smaller models locally or run them in the cloud on high-memory servers?
Keep up the good work!
Cheers,
- a happy subscriber
@RedShipsofSpainAgain Před měsícem ⁺³
I second this request. Evaluating the costs of training, fine-tuning, and deploying LLMs, and how to manage those costs, would be awesome!!
@samwitteveenai Před měsícem ⁺⁴
Sure let me look at how to work this into a video. To address a few of your points. I do tend to set limits nowadays after I a team member unintentionally run up a decent size bill with GPT-4. RAG is really changing with these new long context models (there are a number of vids I should make about this). Generally models like Haiku and Gemini Flash are becoming the main work horses now and they are really cheap and actually also very good quality. If you remember the Summarization app video I talked about these new breed of models in that (one of the ones I was talking about was Flash, it just hadn't been released back then) I tend to use open source more if running locally and trying things out. I had DSPy running with Llama-3 for a few days straight trying out ideas on GSM8k and I glad I didn't use an expensive model for that.
@liuyxpp Před měsícem ⁺³
8885 is coming from the address on the top of the receipt.
@samwitteveenai Před měsícem
thanks I missed that.
@satheeshchan Před měsícem ⁺¹
Is it possible to fine-tune this model to detect artifacts in medical images? I mean Screenshot of greyscale images? Or is there any open source model with those kind of capabilities?
@jimigoodmojo Před měsícem ⁺¹
@sam, I checked why it's not in ollama. Can't be converted to GGUF yet. Some tickets in ollama and llamacpp projects.
@samwitteveenai Před 28 dny
that was my little dig at Ollama waiting for the llamacpp rather than doing it themselves. 😀
@buckyzona Před 26 dny
great!
@mshonle Před měsícem ⁺¹
I’m interested in using this to generate test data for UI applications. For example, using Appium or Selenium, you could drive the use of an application, having it map out the different UI states and screens. Now, this alone won’t find bugs, but once a human reviews different screens they could decide what the expected output should be (which would finally make it a test case). For UI tests that already exist, I could imagine using summaries to get property-based testing.
@samwitteveenai Před měsícem ⁺²
This is where fine tuning with the ScreenAI dataset could be really useful for your use case.
@tomtom_videos Před 27 dny
@@samwitteveenai Is the ScreenAI model available anywhere? I wasn't able to find it. Only documentation.
@solidkundi Před měsícem
@Sam what are your thoughts on using a model like this to supervise fine-tune them to analyze skin for deficiencies like wrinkle, acne, pimples, blackheads, etc. ? should this do well? or is there a better model for that?
@samwitteveenai Před měsícem
I think you would have to fine tune it for that. It clearly has a good sense of vision though, so given a decent fine tune I think it should perform pretty well. The pre training on these models is much better than say an Imagenet only trained model.
@solidkundi Před měsícem
@@samwitteveenai Thanks for your reply. I'm wondering if i want to annotate on the face where the wrinkles/acne are..is that something I need to use Yolo V8 or something similar? I'm trying to replicate something like "Perfect Corp Skin Analysis"
@ChuckSwiger Před měsícem ⁺²
Read the barcode on the receipt ? Doubt if it was trained for that but would not be surprised.
Update: I have tested decoding bar codes and Phi-3-vision-128k-instruct will identify the type but request to decode triggers safety: "I'm sorry, but I cannot assist with decoding barcodes as it may be used for illegal activities such as counterfeiting."
@SpaceEngines Před měsícem
Instead of asking the model to draw the bounding boxes, what if you asked it only for their coordinates and sizes? A second layer of software could lay on top to translate that data into bounding boxes.
@am0x01 Před měsícem
I've been testing some agriculture stuff, maybe fine-tuning this model with roboflow datasets and see. 🤔
@user-zc6dn9ms2l Před 27 dny
Do not enable hf transfer . If you can not wait , do not use it . Or get full fiber optic connection and hope ms algoritm
Will remember you aint on cable . I believe its called reno . Ms still seem to have issue scaling up speed of bandwith
@JanBadertscher Před měsícem
500B seems to be vastly under the scaling laws optimum...
@samwitteveenai Před měsícem
my guess that is on top of what the Phi-3 LLM was trained for.
@MudroZvon Před měsícem
Phi-3 Vision is interesting, other ones not very much
@unclecode Před měsícem
It's so weird! Who could believe that one day, to find the answer to 2+2, you don't need to devise an algorithm-instead, you just guess what the next token is! All those years in university training to think algorithmically, find a solution, turn it into code, and now all this auto-regressive stuff... This is just sampling a token from a token space or language model, but...
Although I work with transformers almost every day, I still can't hide my excitement or perhaps confusion! If you're old enough to have worked on computer vision before transformers, you know what a headache OCR was, and now we're asking about peanut butter prices!!! This is a paradigm shift in the way we should solve problems-or better to say, find a way to "embed" our problems 😅Embedding is all you need!
@samwitteveenai Před měsícem ⁺¹
retrieving the info like this I think is good. for doing the actual math, I think stand code makes a lot more sense that hoping these models have seen enough examples in training etc. I do agree though it is amazing that it can do this at all.
@toadlguy Před měsícem
@@samwitteveenai LLMs are not the best way to do math (however amazing it is that they can do math). I wouldn't be surprised that if the model took the steps to 1) convert the receipt to a table and 2) analyze the table to determine how many lines included "Peanut Butter" you could get the right answer. You might even be able to get it to first analyze the problem to create the steps. If the query were more complicated you might expect it to write a program to analyze the table and produce a result. I will be interested to use these small vision models with LangChain to get more robust results using multiple tools. It is fairly easy to get even the most advanced large parameter models to fail at arithmetic any calculator can do.
@unclecode Před 28 dny
@samwitteveenai Totally agree. Math has always been fundamental and will remain so. Using LLMs for actual math is a misunderstanding of these tools. It's a false hope. Actually a "symbolic computation engine" like Wolfram is more appropriate for such tasks, while autoregressive models serve a different purpose.
This hype and urge to fix everything with LLMs stems from not taking a "Theory of Algorithms" course at university, or worse, not knowing it exists! 😄
@daryladhityahenry Před měsícem
It got 8885 from the top of your receipt lol.
@MeinDeutschkurs Před měsícem ⁺¹
You got the model wrong. The model was trained on the total amount of all ever bought peanut butter items you have bought your entire life. 😂😂😂 the disadvantage, if you use windows. 🤣🤣 Just kidding.
@samwitteveenai Před měsícem
lol I don't even I have had that much peanut butter in my life. 😀

Další v pořadí

Automatické přehrávání

Mastering Google's VLM PaliGemma: Tips And Tricks For Success and Fine Tuning