Gemini Demo But With GPT-4 Vision API
Vložit
- čas přidán 27. 07. 2024
- Github: github.com/unconv/gpt4v-gemini
In today's video I showcase a Python program I have made using OpenAI's GPT-4 Vision API, Speech-to-text API and Whisper, that attempts to accomplish what the Google Gemini multimodal demo shows.
More information on the project coming up in future videos.
Support: buymeacoffee.com/unconv
Consultations: www.buymeacoffee.com/unconv/e...
Memberships: www.buymeacoffee.com/unconv/m...
00:00 Demo
03:25 Bloopers
05:26 Unedited Version - Věda a technologie
Better Demo than Google 🙂
Your demo is better than the one from Google, it looks like they hand selected screenshots to send and gave more hints in their prompts, but didn't include the whole prompt in the demo.
Wow! Looks impressive!
❤
that last blooper was funny :D
nice demo!! the most interesting part here I'd say is when to capture the screenshot - maybe when you pause talking?? 🤔 and maybe you can add multiple capture when there's movement or diff images every second.
When it detects movement, it starts saving all the frames until movement stops. Then it splits the list of frames into six equal parts. Then it takes the sharpest frame from each part and makes a collage from them and sends that to ChatGPT. And yes, when talking stops, it sends a screenshot. If there was movement during talking, it sends the collage as well.
@@unconvawesome!! great job!! I love the idea of making a collage - I didn't see that coming. keep it up bro!!
Are you sending all the frames to gpt-v? I have a function which compares subsequent frames in a video and only extracts the ones that meet a difference threshold, so for example out of a 25 second video, it might pull out 7 frames which represent sufficient difference to be significant enough to send to the api.
you can even diff screenshots every second, I think that would be sufficient enough.
@@robrita it depends, for example where scrolling text is involved it is not and there is no point in introducing loss potential where there is no cost. A lot can happen in a second.
@@avi7278 of course there's a cost, why assume not?? more images will took long time to respond, it's unnecessary resources grabbing for most of the use cases.. it's not like your yolo v8 on your pc 😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆
It only sends a collage of 6 "strategically selected" frames during movement (one image). And one image after talking stops.
Will it work if I try to change the model to gemini vision with all the parameteres?
Great demo!
Unrelated to this video... I tried your "ok-gpt" code with whisper (the tiny model) on a Pi 4. Recognition works fine, but latency is kind of a deal breaker :( I guess I have a reason to get a Pi 5 now!
Thanks! Good to know. Maybe I'll try to switch it to using the Whisper API (and leak all conversations to OpenAI lol)