Gemini Demo But With GPT-4 Vision API

Sdílet
Vložit
  • čas přidán 27. 07. 2024
  • Github: github.com/unconv/gpt4v-gemini
    In today's video I showcase a Python program I have made using OpenAI's GPT-4 Vision API, Speech-to-text API and Whisper, that attempts to accomplish what the Google Gemini multimodal demo shows.
    More information on the project coming up in future videos.
    Support: buymeacoffee.com/unconv
    Consultations: www.buymeacoffee.com/unconv/e...
    Memberships: www.buymeacoffee.com/unconv/m...
    00:00 Demo
    03:25 Bloopers
    05:26 Unedited Version
  • Věda a technologie

Komentáře • 15

  • @mallardlane8965
    @mallardlane8965 Před 7 měsíci +3

    Better Demo than Google 🙂

  • @corvo1068
    @corvo1068 Před 7 měsíci +1

    Your demo is better than the one from Google, it looks like they hand selected screenshots to send and gave more hints in their prompts, but didn't include the whole prompt in the demo.

  • @YuraL88
    @YuraL88 Před 7 měsíci

    Wow! Looks impressive!

  • @Crovea
    @Crovea Před 7 měsíci

    that last blooper was funny :D

  • @robrita
    @robrita Před 7 měsíci +1

    nice demo!! the most interesting part here I'd say is when to capture the screenshot - maybe when you pause talking?? 🤔 and maybe you can add multiple capture when there's movement or diff images every second.

    • @unconv
      @unconv  Před 7 měsíci +5

      When it detects movement, it starts saving all the frames until movement stops. Then it splits the list of frames into six equal parts. Then it takes the sharpest frame from each part and makes a collage from them and sends that to ChatGPT. And yes, when talking stops, it sends a screenshot. If there was movement during talking, it sends the collage as well.

    • @robrita
      @robrita Před 7 měsíci +1

      ​@@unconvawesome!! great job!! I love the idea of making a collage - I didn't see that coming. keep it up bro!!

  • @avi7278
    @avi7278 Před 7 měsíci +2

    Are you sending all the frames to gpt-v? I have a function which compares subsequent frames in a video and only extracts the ones that meet a difference threshold, so for example out of a 25 second video, it might pull out 7 frames which represent sufficient difference to be significant enough to send to the api.

    • @robrita
      @robrita Před 7 měsíci

      you can even diff screenshots every second, I think that would be sufficient enough.

    • @avi7278
      @avi7278 Před 7 měsíci

      @@robrita it depends, for example where scrolling text is involved it is not and there is no point in introducing loss potential where there is no cost. A lot can happen in a second.

    • @robrita
      @robrita Před 7 měsíci

      @@avi7278 of course there's a cost, why assume not?? more images will took long time to respond, it's unnecessary resources grabbing for most of the use cases.. it's not like your yolo v8 on your pc 😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆😆

    • @unconv
      @unconv  Před 7 měsíci +1

      It only sends a collage of 6 "strategically selected" frames during movement (one image). And one image after talking stops.

  • @DarkNetDragoon
    @DarkNetDragoon Před 5 měsíci

    Will it work if I try to change the model to gemini vision with all the parameteres?

  • @thenoblerot
    @thenoblerot Před 7 měsíci

    Great demo!
    Unrelated to this video... I tried your "ok-gpt" code with whisper (the tiny model) on a Pi 4. Recognition works fine, but latency is kind of a deal breaker :( I guess I have a reason to get a Pi 5 now!

    • @unconv
      @unconv  Před 7 měsíci

      Thanks! Good to know. Maybe I'll try to switch it to using the Whisper API (and leak all conversations to OpenAI lol)