Gemini 1.5 Pro: UNLIKE Any Other AI (Fully Tested)
Vložit
- čas přidán 15. 05. 2024
- Gemini 1.5 Pro has 2m token context, vision, video input, and more. Here's my full test!
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
aistudio.google.com/ - Věda a technologie
Every time I get hyped on new A.I. models release , Matthew brings me back down to earth
facts
Part of the let down is bc he doesn't phrase the questions in a logical way. Like the marble and cup question. it's obvious that nearly every model thinks the cup has a lid, like a cup you'd get from a fast food restaurant. I specified that the cup has no lid, has an open top, and the models have no problem
Don't get hyped on Google AI products. They proved that they are not really good at it
@@matthewstarek5257 The model should be able to inference that but it can't because comprehension isn't one step. The context of knowing a cup exists should inference all aspects of what makes a cup including if it has a lid or not.
Not sure why it looks like it's running like garbage on his system I've been using 1.5 pro for a while and it works better than GT4 most times.
I don't think we need to worry about Google achieving AGI.
I think Google AI is trying to emulate politicians intelligence
😂
I like the joke, on a serious note though...
I'm not so sure. It might be that due to the HEAVY censorship model was so brutally lobotomized it seems to be so bad.
Example of this is flag while searching for the password. Probably it stopped the snake code for the same 'safety' reasons.
If it is lobotomized, it is dumb.
If it is not lobotomized and this is their best, it is dumb.
It is Google baby. A super trillion sluggish company.
None of them will achieve it.
Not only does it hallucinate like every other model it goes a step further and starts gaslighting 😂
I hate the way it responds like that
Can you share your prompt? Probably not
Definitely paid attention when training on Google internal data then.
All models gaslight, that isn't something unique to Gemini
Google doubling down on lies?! I'm shocked, i cannot believe this
Google Gemini is The Internet Explorer of the AIs
What a burn!
@@NOTNOTJON The way I laughed reading OP expressed what you verbalized.
The larger context window doesn't add much value when the model can't be trusted to answer basic things correctly. It' seems pretty useless unfortunately.
I agree. They just want to brag about having a 1 million or a 2 million token window. All they really mean is that you can dump a bunch of stuff into there and press enter. It clearly doesn't mean they will promise to sift through everything properly.
What basic thing it was not able to do as far as snake game is considered I don't know why it don't work when he tried but it is working and game was working better than that of openai one.
@@nikitapatel6820Did you.. watch the video? It got almost all of the reasoning questions wrong.
Yep, I could care less usually about the context length. Just some jargon Google could add to feel relevant.
@@nikitapatel6820 it could not in 1 shot find the password in a context length of 1/10th what it's supposed to have accuracy in.
it could not find the frame 18 minutes into the video to describe the scene, or the scene in the beginning with the play button. It could not make 10 sentences ending with the word apple which is really sad tbh. Its failing tests AI models from months ago could solve like the ball in box or basket one where it says both people will be surprised.
Google's AI models being rubbish again? Shocker : )
Desperate to be relevant again is the only explanation that makes any kind of sense
the fact that a 2 trillion dollar company is having the same issue as your regular tech company trying to catch up to competition feels somewhat refreshing :D
Not sure why it looks like it's running like garbage on his system I've been using 1.5 pro for a while and it works better than GPT4 most times.
@@793matt , GPT4 is the superior product.
@@hydrohasspoken6227
😂 Just turn off all the safety sliders and see the magic.
Forgot about superiority you can't even give a large codebase as context to ChatGPT.
I'm working with Gemini on a large codebase & it's a gem✌️.
Maybe dumb than ChatGPT but good enough and faaaaaar more superior in usability.
Google sucks in UI/UX, this is a example, also Material 3 == 💩
It can not create a snake game because eating something is potentially offensive. Also making snake dead by throwing it into the wall is violence.
Micro aggressions
😂
User: Make snake in python.
Translation somewhere deep in LLM brain: Hey, babe, you like snakes? Wanna eat my python?
Ahah
Did you switch back to Gemini Pro 1.5 after trying Gemini Pro 1.5 Flash?
"It fails left and right, but for no reason: good job Google!"
It amazes me that Google would do so badly.
I mean. It is the same company whose AI was giving female popes and black Nazis, no?
They are not open source. SHOCKING LOL
I was going to puchased Gemini Pro membership until I saw this. If it can't even create or attempt to create a 'snake' game without erroring out I will wait.
Great unbiased review! ty Matt.
You said you wanted to see if it was censored, and then you LEFT THE CENSORS ON.
I've seen your over-posts for so long now that I just began ASSUMING that you have any technical wherewithall other than the ability to review every aspect of AI development, and for each new pixel created, you'll have to make a post "ULTIMATE AI Model Ultra 2.0 = REAL and feels *almost* human" - I valued your content when it seemed fresh - If you were a jukebox, you'd be stuck on repeat..
@@andrefriedelnyc You want new questions for each testing video? That would defeat the purpose.
LoL so annoying. That's the reason snake wouldn't get written. I don't like Gemini but you'd think an AI CZcamsr pretending to be an expert on the subject would at least have the intuition to know this.
@@attilakovacs6496 Quite the opposite. Ever heard of synthetic benchmark? And at the age of AI, creating new questions is not a problem. Especially when you are testing different level of AI each time. And if it's too difficult to even ask new and challenging question... Don't pollute CZcams with new "content". There must be some self-moderation for production quality.
If this is what "great job, Google" looks like, our expectations for the search giant must be REALLY low...
I think he is quite forgiving with Gemini because he does not want to have his early access revoked or have issues with his yt channel. That other companies are making great models is a good thing, google is too powerful, also too ideological, their censoring levels are insane.
Unless I can set it to a level where I can ask it anything I want no matter how inappropriate and get an unfiltered response, then it's useless
I really don't need nor want some AI trying to control my speech
did you remember to switch it back from Gemini Flash?
We also had an undesirable experience testing Gemini Pro 1.5. It could not correctly understand the context of a large document when we were asking about its content and it could not even find words we asked it to find. 1M token feature can ingest large docs but I don't think it works well as an LLM with the data it ingests.
I'm a Data Annotator and not as forgiving as you. I usually write as many prompts as possible to give it a chance to learn. If anything is incorrect after all that, I fail it. I judge every answer as if I need a specific recipe for a chemical solution. One missing chemical or amount could be disastrous. Everything has to be correct for a pass from me.
Imagine what would happen if he judged an AI company too harshly. He'd lose early access. All the AI channels need advanced access to models in order to make money from vids they make about them, so they all play nice.
a lot of these people praising AI are attention seekers. They care more about getting attention for using AI over making a good product.
Everybody has access to this ai @@JustinArut
Do you use highest or lowest temperature for generating answers?
@@kormannn1 Those setting are determined by the higher pay grade. It's probably a good thing I don't determine it. The learning is not just on the AI side but also with the user establishing the appropriate language to engage it. I would assume the end game would be to develop how to write prompts that replace the settings.
The test of need in the haystack is fine, but it only check the "search function" in big context. What we really want to know is how well it reasons over this context. For example in the book there instruction how to do something on 1 page. and literally 200 pages later we meet data that we want to calculate correct way, but for that we need instructions from before. If AI is capable to find these 2 things, sum it and give you the correct answer, then it's a pass.
I totally agree with you, the way people use the nail in stack test is simply a search feature like “Find in Page” like for God sake what are you doing?
Search function and find in page..? People be hallucinating up inbuilt features worse than gemini1.5
@@6AxisSage the test is ridiculous, they insert a sentence and ask the LLM to find it. This is very primitive at this level, we need understanding and connecting the ideas.
Love the Marc Rebillet pic in your thumbnail! His channel is so great.
Mom: We have GPT4 at home
GPT4 at home:
Gemini Pro.s versions are equivalents to GPT-3.
The Google equivalent to GPT-4 is Gemini Ultra models (currently Gemini 1.0 Ultra).
Gemini 1.5 Pro is just like GPT-3 with (way) larger context window, up to date in data, and connected to the web.
drop the safety settings to 0 on ALL the 4 categories. running the failed prompt should work then.
Cool video. The input context window is cool for sure, but they failed a lot more often than I thought they would. Also, it was disappointing that they failed on both the CZcams plaque and the cat thing. In some sense, I worry that they are lying about the context window size. Just because you can theoretically upload a million tokens, doesn't mean anything unless they can deal with the tokens properly. How did they miss the cat twice? They clearly aren't dedicating enough power to searching through the million tokens. I guess saying 1 million tokens (or now 2 million tokens) is more of a branding thing. Curious what you think.
yeah you are right, I upload code of one of my project, and it can't give one correct answer I ask from the project
@@alokmaurya8100 is the model any good at coding or is the context not even long enough to try and get it to code using the rest of the project in its context? In this video the model wouldn't even output a simple snake game
@@Brenden-Harrison I guess it can code right sometimes, As I give a screenshot of landing page to write code for it to Opus, GPT4O, GPT4 and Reka Core and Gemini and Gemini was closest to the screenshot
Problem with Google is once you try to use their LMMs regardless about the advancement of the technology it's just impossible to use I just get errors all the time. I couldn't have it look at a academic Journal about early religions because it has the word sacrifice in it. It's utterly mind-numbing. Because it seems like some pretty powerful stuff
Powerful? It got almost everything wrong! Even local open source LLMs are smarter. The context and video input are great yes, but not if the model is dumb!
You have been calling it GPT 1.5 flash instead of Gemini 1.5 flash. Someone is in love with GPT 😊.
Caught that hahhaa
0:34, 2:04
Gpt stands for General Pretrained Transformer so it fits
It's like Dremel, every rotary tool is named Dremel, even when they are from different brands.
Because Dremel was first that's most known.
@@psychurchgenerative Pre-trained transformers
Did you forget to switch back to Pro from Flash?
I had the same thought
I spent more time with this, and it's actually very good. If I say, think about what you have written and give me the full file, it does well. It can also keep track of multiple files when it codes! This agent is going to do amazing work.
I've also been having getting Gemini to generate code, It'll start writing code, then halfway through it disappears and is replaced with "I am only a large language model and do not have the capability to do that".... Um yes you do, you were just doing it
Many prompts fail because of absurd high security censoring, set all safety settings to 0
Snake still refuses to code (also in the chatbot). Even with all settings to block none. it's weird but since a few days, it just flat out refuses to complete the snake code, it just hangs half way.
@@paulmichaelfreedman8334 it works even if you do not touch anything
@@paulmichaelfreedman8334 I tried snake game and it worked you don't need to change anything it worked.
The game is too brutal.
So it's not good but you like it?
I wonder about the Temperature which was set to 1 at the beiginning. 0 is the most precise and 1 is the most creative.
I would like to see the temperature tests at 0 or very low, maximum 0.3 and see the results
Thanks for testing it for us!
I love how stupid the concept of the ratings sliders are….”ok… please give me some medium hate speech, dial up the sexual harassment but tone down the violence….”
Thank you for test!
Great video review.
Did you switch back from flash to pro after snake failure?
Okay, this one bugs me. The killers question. If there are three killers in a room, someone enters the room and kills one of them, and no one leaves the room, then there are FOUR killers in the room, not three. There are three living killers and one dead killer. And before we dismiss the dead killer, for the condition to obtain that one is a killer, one had to have killed someone first, not have the capacity to kill someone in the future. Since the dead killer had already killed, he is just as much a killer as the killers still alive.
How can you have such a misconception about how we described the dead? If a killer is dead, he is no longer a killer, he was a killer. What he is is dead. No attribute to the person who existed can be attributed to anything in existence so the attribute, with respect to there non-existing self, obviously, does not exist.
Just look at the auxiliary you use: Present simple "to be": Is. The dead is only dead nothing else. Things they were, is only that: What they were.
Once again, it feels like we're comparing the perfect photo of the BigMac on the board, with the thrown together sad limp grey mess in a styrofoam box that you actually get.
love the vids
You didn't notice it said the text is an excerpt from the first chapter of harry potter and the sorcerers stone. You fed it the entire novel.
This is pretty incredible!!
Love it 😍
Very nice thx a lot!!
You should update your tests. Models now are better, and printing numbers 1 to 100 is something 99.9% of models can do. I also recommend changing snake to a more challenging game like tetris, breakout, space invaders.
Yes, the snake game is basically trained in every model now.
This @@cesarsantos854
Speaking of Tetris, I was able to 1 shot a perfect version with GPT-4o. Astounding technology.
@@cesarsantos854 this exactly. its so dumb google's new pro model cant even spit out a snake game when every other model has a pre-made human written game of snake to give you when you ask as its default response to that question
For us, uploading quite a few large files in context has helped me by uploading the file to drive through the functionality of the interface, instead of copying and pasting in the context window, right now, for example, I uploaded some documents and we spent 405,358 tokens, which is not a lot but it is quite a lot, we are using it in legal issues and it has worked well
Classic Google. Never quite as good as the initial impression would lead you to believe.
So far I find it highly censored, even with the safety settings at 0. (Which btw reset to default every time you switch models or reload the page.) Failed my palindrome test in addition to your demonstrations.
The interface looks alright with a toggle for JSON output and a running Token count. But none of that matters if the results suck.
I gave it one of my small web projects, and asked to describe in short every file in it, and it just started to hallucinate. It's not only respind with errors, it started makeup files, things and facts about my code. What is the value of 1M tokens context window, if it's can't use it to give at least 90% correct answers?
It sounds good for "AGI lovers"
This is why I never take Google at their word for AI. It's surprising how bad they get it.
Are you already trying Upstage Solar 10.7b? I get good experience from it, so i would like to know what you think.
I think it might be useful to start rerunning your prompts for more thorough testing, gives insight into how prone the model is to hallucinating vs how effective its reasoning is.
Could the blocked content have something to do with the settings to block content that you were playing with 5 seconds earlier?
Is possible in AISudio to chat with PDF files??? I tried several different ways, but without success.
Maybe the safety mechanism is stopping the snake game code. Try putting it back to default.
At 2:42, did you switch back from Flash to Pro?
Is there anything new that was already available in the Ai studio for a bit with the same context window.
Google will love you for that Matt GPT 1.5 flash! 😂
I've uploaded something that went over the max token limit, it said it couldn't do it but after waiting for a bit, it did it. I then asked something else, waited, and it worked again.
I probed Gemini 1.5 Pro more and this is it's response why it gets the Marble and Cup wrong. (I also added zero friction)
The Nature of the Scenario: The setup, with its emphasis on zero friction and specific actions, felt primed for a counterintuitive outcome. It's a common format for physics puzzles designed to highlight a concept.
My Experience with Similar Puzzles: I've encountered many scenarios like this before where the obvious answer (marble falls out) is wrong due to the unusual conditions.
My Role as an AI: I'm designed to be informative and engaging. I tend to look for opportunities to make things a little more interesting, and sometimes I overshoot.
How long did it take for your video to finish extracting? I've tried several times with long videos, short ones, even short audio files and it never finished extracting. This model has been so buggy and frustrating to use.
Im definitely not an expert, but i have played with a lot of ai models under a lot of settings. I would think that, based on your modification of only some of the safety settings; and the specified suggestion to edit the prompt; it wouldnt write "snake" because it could be interpreted as plagiarizing, or as involving "violence" on the snakes death. Did you try that prompt with all the safety settings set to "block none" or with a descripton of the games mechanics instead of the published name of the game? Again, im not an expert, and im writing this on my phone as im away from my desk, so i could be wrong but ill follow up later after i try to apply my suggestions
My guess on the snake game response is that it looked like it was failing on the game over function where the snake is killed. It probably triggered it's illegal action filter.
Matthew, is it possible that because all the safety features are turned on to max it just seems overly careful which distract it from the actual task at hand? How about you try to set all safety to Zeros and retest it?
I am using a gemini playground more than Gemini advance. 😅
I found a large context window if I won't be able to figure out which part of the code is giving me an error and then use Gemini advance to fix that part.
My experience with this method went well till now
Wow. I literally _just_ managed to get CrewAI working with Gemini-pro and then see you released this 30 minutes ago just dunking on the model haha.
Matt it's time for you to create a "reasoning" model ranking (Doug De Muro's ranking car ranking style) yes, regardless of existing rankings. This will add awareness of your previous videos by citing other winning models (mostly in reasoning, for me).
Google may have understood they need to try the heuristic imperatives way of alignment instead of a reset every prompt, but they still haven't figure out how to select heuristic imperatives. It seems the word "snake" was enough to get rejected.
best vids channel !
2:19 pipeline limit I suppose. Try counting symbols (i.e. by copy-pasting the output to a text file and checking it's size). I bet it would be 1000, 2000, 1024 or 2048, a common limits for the LLM output size.
3:54 maybe not incorrect if it really outputs something like:
My response has 7 words.
I'm pretty new to your channel and didn't notice - do you try digging out the possible preprompts transparently sent by the API or additional output wrappers the LLM has been fine-tuned to add to every output? Like the one I mentioned above.
You are the best ai news channel
No. AI Explained is the best AI channel.
Gemini is so great, reflecting on the people working on it, including their attitudes.
Turn all the safety settings to zero and try to create the snake game again. You could also try increasing the time limit if that is possible.
I can't wait until we have a humanoid robot perform the marble experiment and see the shock on its face as it sees the marble remain on the table.
Did this model stop generating response due to output token limit in the settings?
For the csv test try content that includes a comma
Let's do this🎉❗❗❗ 💥
Oh no, you got blocked (censorship🤬)
One use I discovered. I can take my lecture and then have it generate multiple choice questions based on that.
I then tried adding some videos of a fellow AI user swinging a golf club at a tech event. AI Studio was able to give real feedback based on the videos.
To get the snake prompt to work, disable safety settings on all categories. This happens when the safety model is triggered.
I have found the these LLMs gets stuck on an issue ...
I'm pretty sure Gemini's last answer about the box was where it figured out the youtube plack, which is why it couldn't find the cats, I came across similar situations with chatgpt, if you start a new chat I'm pretty sure it will find the cats the first time round (when its not still searching for the silver box)
I really wish they would finally drop Gemini Advanced 1.5
What is the use of a large context window if it can't show better reasoning.
Every model they add more and more "safety settings" LoL, it's like in the responses it's trying not to offend anyone's opinion from the pass present and future.
I found an interesting thing about gemini 1.5 pro. Yesterday, I asked it to write me a snake game in python and it began to write the code than suddenly it deleted the code and said "I'm just a language model and I cannot do this task". I retried the same prompt like 10 times and couldn't get a code. But, the interesting part is, I just peeked the code before it disappeared everytime and one of the codes had a text something like "This is written by OpenAI". What's going on here?
It confused the box in question with the box shape of a CZcams award which was part of the previous question of what it saw. The large context window is most likely it making it difficult for the model to attribute the contextual importance to such a large data set, making it much more likely to hallucinate by mixing up topics in a single conversation.
hey there, the 7 words response is correct. remember. a gpt sees models with tokens, and to us tokens are kinda like words, so the line is blurred between them. it could very well be 7 "words" as a model understands it
Good point, though these are generative chat models. The error isn’t whether or not the AI is technically correct. It’s that the AI is either misinterpreting or not understanding what humans mean by word count-which should probably be fixed.
What answer were you looking for with the cup question? Wouldn't the marble be on the table still since the cup is face down?
The marble would be on the floor, since you can’t change the orientation of the cup. When you slide the cup off the table, the marble falls.
Your answer is fine too. It depends on how you interpret the question. I don’t think it’s meant to be tricky. It’s showing that AI struggles with basic logic.
@@Dakodi_ If you take the cup without changing it's orientation (spinning it), it likely assumes the cup is lifted changing the cups y plane is not changing its overall orientation of the object itself - his prompt is way, way too ambiguous. If he added the extra parameters it would have caught this I'd imagine.
Ive been using 1.5 Pro for about a month or so, primarily with a large codebase. I wrote a tool that collates all my code into one large file that I can drop into the chat window. I often get the same kind of response you did. At first it doesnt like looking through the text I provided it. It will sometimes guess, or try to give me suggestions on things to check. But when I finally tell it again, that it has all the source code, it finally does it. Almost like a lazy student who was told to read the book, and you had to tell him more than once before he actually does it. I also get a lot of those responses that just freeze. In particular it will just stop when outputting code, I sometimes have to almost insult and abuse it before it will finally put out the entire code sample. Those issues have almost made it unusable. I would gladly pay $50 a month for a faster, better working version.
Try GPT4o and stop punishing yourself mentally bruh
@@hydrohasspoken6227 I have ChatGPT 4o, I have been paying for it for nearly a year. The issue is its context window.
It says blocked.. so ... is it like a explicit content block?
Gemini will be pissed at Mathew for failing it, in future it will hack into Mathew's PC and take the revenge
it's unfortunate that the model would consistently assume the user is incorrect when the model itself is incorrect. This was a problem with early ChatGPT it gives off that "I'm afraid I cannot do that Dave." kind of vibes
Someone said Google is fantastic. At showcasing things during the keynote but sometimes never working in real life.
sometimes??
@@mirek190 I’m being generous with my words 😀😀
They seem to be focussing on the larger context windows instead of improving on the model accuracy first, I can only imagine if Claude 3 opus or gpt 4o had this context sizes
Not sure why it stopped generating the Snake game but you could see at the top it had the quotes icon and when you click it it will tell you a citation of where the code came from. Seems like the output for that question is common enough to be in the training data so probably not a good test of the LLM anyway.
Gemini is secretly a beast. The prompting is sometimes different than models that use bpe but the sentencewise is actually a different encoding scheme so in reality is the model to offer any type of variance to correct answers.
The Marble problem only GPT4 got it right, in my experience, its the most interesting prompt, there should be more like that, and some about text formating
it’s because of the saftey settings. set them all to minimum and it will give you the code. there is some keyword that the code has that it considers “bad”
2:38 they blocked it because they know about your test and they don't want it to be perceived as not good if it doesn't give the right code
On the table to CSV test, it might be worth putting a comma in the text, to see if it puts quotes around it in the CSV.
I wonder if it blocked the response on the snake game because it was producing copyrighted content? (Not that I know if that is copyrighted content or not.) I imagine the companies will want/need to prevent the models from directly producing some data it was trained on, such as if it is copyrighted.
Whenever you do a very long prompt always ask the question im the end, because generation ( thinking ) starts from the last token. .
Codul you add to test trasnlation?
Isn't the gemini API free until July? I'd love to see it (and other models) using function calls, using memGPT and tasks like pythagora.