I pressure tested GPT-4's 128K context retrieval

Greg Kamradt (Data Indy)

zhlédnutí 21 871

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 23. 07. 2024
Get updates from me: mail.gregkamradt.com/
FullStackRetrieval.com
Tweet write up: / 1722386725635580292
Code: github.com/gkamradt/LLMTest_N...
Check out how GPT-4 does at retrieval with 128K tokens worth of context.
Lost In The Middle: www-cs.stanford.edu/~nfliu/pa...
Greg’s Info:
- Twitter: / gregkamradt
- Newsletter: mail.gregkamradt.com/
- Website: gregkamradt.com/
- LinkedIn: / gregkamradt
- Work with me: tiny.one/TEi2HhN
- Contact Me: Twitter DM, LinkedIn Message, or contact@dataindependent.com
Věda a technologie

Komentáře • 73

@gardnmi Před 8 měsíci ⁺²¹
OpenAI should be paying you $200 an hour for this type of in-depth analysis of their model.
@DataIndependent Před 8 měsíci ⁺⁴
ha! I wish, but thank you. I'm sure their product analysts are doing the same (and more) as we speak
@SaileshB Před 8 měsíci
What an amazing quality video! Thank you for this test
@Adhithya2003 Před 8 měsíci
Superb test. Thanks for doing this.
@ultraprim Před 2 měsíci
Brilliantly executed. That graph is incredibly intuitive and information dense.
@thenoblerot Před 8 měsíci ⁺⁸
Sweet! Thank you investing the $$$, time, effort in this! It's great to have these data points. tbh I'm just a hobbyist with no actual use case for such a large context length at the moment, but I agree it's valuable to gain an intuition about how these models behave.
I've been curious about how annotation and structure affects retrieval at such long context length. In your example the retrieved information was unrelated to the main text and placed randomly. I wonder, in a situation where context and query are related, would performance increase if the LLM was given a document formatted more like a book (or github repo), with a leading Table of Contents, and maybe even a trailing index? We know that the GPT-x models love markdown structure, would that make diff? Anyway, there's endless experiments one could run, and I'm sure we'll be seeing more research papers soon. My instinct says vector, metadata, and even purely text search based retrieval will remain valuable regardless of how large context lengths get. Why wouldn't you try to increase signal to noise if you have the tool?
@DataIndependent Před 8 měsíci ⁺³
I totally agree! I can't think of a use case of long context that wouldn't benefit from retrieval to increase the signal to noise ratio.
There are so many variations of this to try, if I had $400K we could put together a pretty well researched test of tons of permutations to build up an intuition. But there are a lot of other ways to spend that kind of money which would leverage more value too....;)
@ShaidaMuhammad Před 6 měsíci
You are doing great work brother, keep it up.
@andreyseas Před 8 měsíci ⁺⁶
Nice! This is much needed content. Not enough people are talking about this. I'd be curious to see how Claude 2 100K context retrieval would compare.
@DataIndependent Před 8 měsíci
Thanks Andre! It was a fun test to do
@scharlesworth93 Před 8 měsíci ⁺¹
Actually I heard of the 'lost in the middle' issue in the context of Claude 2 100K...so now we know it affects OpenAI too....
@andreyseas Před 8 měsíci
Interesting! Will have to dig into that more myself. Thanks for letting me know! @@scharlesworth93
@justjaeisfine Před 8 měsíci ⁺¹
Love the analysis Greg! It's great to see your action taken on these questions. Would love to see it on a new Claude 200k context
@DataIndependent Před 7 měsíci ⁺¹
Thanks Jon! Of course, here ya go: twitter.com/GregKamradt/status/1727018183608193393
Same process, different model.
BTW it was awesome working w/ ya in our prior lives
@jourdainlouis8553 Před 5 měsíci
Great content and very solid reasoning ! Will definitly try this approach on free to use models ! The needle in the haystack approach is great to challenge LLM retrieval abilities, however it might be easier for the LLM to do well because the sandwich in San Francisco "semantically stands out" from the rest of the essay,. It would be interesting to ask the LLM a precise information about a precise information already contained in the given context (Graham's essay) to make the task "closer to user's need" .
@DataIndependent Před 5 měsíci
nice! Yep totally agree
@alexanderroodt5052 Před 8 měsíci ⁺¹
Love the enthusiasm. Most people I know think this stuff is boring.
@DataIndependent Před 8 měsíci ⁺¹
ha - totally, I can't tell if I'm brainwashed or what
@bvdlio Před 8 měsíci
Very well mad video, great research. Thanks for the investment!
@DataIndependent Před 7 měsíci ⁺¹
Nice! Thank you
@maof77 Před 8 měsíci ⁺¹
Great Video. Would be interesting to see how the model works if you place 2-3 "needles" in the text at different positions. Would help to know for odering responses in RAG with large chunks.
@DataIndependent Před 8 měsíci
Totally, that would be a solid test. There are so many variations I’d like to do but would cost a ton of $$
@raregear Před 3 měsíci
This becomes STANDARD Retrieval Benchmark on every major model release
@PrimeMindAI Před 7 měsíci
Good stuff!
@kai_s1985 Před 8 měsíci
Thanks. Would be interesting to see the performance when the sentence is placed not as a new line, but as a continuation or inside of a paragraph. New line might be easier to detect.
@DataIndependent Před 8 měsíci
yeah...I did sentence breaks with new lines so the results would definitely change
@adamgdev Před 8 měsíci
Solid video. Would've loved to see some of the tweet reply discussed in this video.
@DataIndependent Před 8 měsíci
Thanks Adam! Totally! I should have included that - I might do a recap video on it
@SuperYutubu Před 8 měsíci
Awesome !
@DataIndependent Před 8 měsíci
Thanks Yutubu!
@hitalex07 Před 5 měsíci
Excellent work! Do you plan to do this test with the new Gemini Pro 1.5 model?
@DataIndependent Před 4 měsíci
Yep - when there is access and it comes out totally
@jeffwads Před 7 měsíci
Have you tried running this test on the great Mixtral MOE?
@MridulBanikcse Před 3 měsíci
how are you evaluating score?
@Li-rm2gj Před 8 měsíci
It’s an interesting result, thanks for doing it. Any ideas why the conclusion seems to be different from the lost in the middle paper? Are the takeaways contradictory?
@DataIndependent Před 7 měsíci
There is so much variability with these tests it is tough to pin down what it would be
@DannyGerst Před 8 měsíci ⁺¹
Do you like to make the script public? Would be awesome to test other models with. I am on your mailing list, but where do I need to sign to get that on the silver plate ;-)
Oooh, I found it. Thanks for putting it in your repo. Very valuable.
@DataIndependent Před 8 měsíci ⁺²
I just put the code in the description! Thanks for the call out
@dcrebbin Před 8 měsíci
absolute mad lad
@DataIndependent Před 8 měsíci
Thanks Devon - it was fun to do!
@hidroman1993 Před 8 měsíci ⁺¹
I guess a needle should be a something the model can't make up, e.g. "The answer to the question that you are going to be asked is '9h550klz2a6'"
@DataIndependent Před 8 měsíci ⁺²
When I was getting feedback on this after the test I was told that uuid key value pair retrieval is the standardized test. Makes sense.
I went for relatable in this version
@Joao-pm8je Před 8 měsíci
Awesome test. Really clean and concise. I would recommend getting rid of the side camera, since it adds nothing of value and looks a bit weird. Cheers!
@DataIndependent Před 7 měsíci ⁺¹
Thanks Joao! Totally agree
@Unknown16633 Před 8 měsíci
Nice analysis! How exactly did you measure the correctness of the answer?
@DataIndependent Před 7 měsíci
I used LangChain's eval which was easy
github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/73ffdd4dd2d190d9306d9162a2401ae9a067ddcf/LLMNeedleHaystackTester.py#L369
@Shoaibkhan-oj3oe Před 8 měsíci
could you please please add details on how this can be done with azureopenai, in your website, I tried the langchain extraction chain with azure openai but I am not able to run it. Had to resort to functions to do that, can you please tell me how can I effectively extract only certain data from the script, also how do I work with tabular data with LLM's?
@DataIndependent Před 7 měsíci
Here's the code, you can fork it and make azure the model provider
github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py
@rosendoduron4753 Před 4 měsíci
Excellent test! I have enjoyed watching your videos. I would ask if you could do the same with Claude 3? It would be nice to see a comparison.
@DataIndependent Před 4 měsíci
Nice! Thank you - I haven't done it with that one yet but it's on the to do list
@hermannschmidt9788 Před 8 měsíci
Interesting! I'd say that large context windows always beats all the previous tricks like chunking + summarizing to shrink the data size.
@DataIndependent Před 8 měsíci
I don’t know on that actually - I’m a big fan of investing in retrieval to get better signal:noise
I haven’t seen a use case that requires 128K tokens of context that wouldn’t benefit from better retrieval
@hermannschmidt9788 Před 8 měsíci
Lossless retrieval into large context windows, yes.
@DataIndependent Před 8 měsíci
@@hermannschmidt9788 sure ya once it can recall & synthesize 100% accurate from long context then it’ll be a way different conversation
@ashlynnantrobus5029 Před 8 měsíci
I just saw your graph for Claude2.1. That was a lot of red. Also interesting that it was almost all 100% or 0%, with very little in between.
For the failures, were you getting a lot of Claude claiming it couldn't do that kind of task? That's probably the most common response I get from Claude for any given task
@DataIndependent Před 8 měsíci ⁺¹
I'd get a lot of this type of response
"Unfortunately, the context does not mention the most fun thing to do in San Francisco. It discusses the history and design of the Lisp programming language, web and mobile application development, and starting technology companies. There is no information provided about activities or attractions in San Francisco specifically. Without any relevant details to draw from, I cannot provide a direct answer to the question asked."
@ashlynnantrobus5029 Před 8 měsíci
@@DataIndependent so actually trying, but just looking for it like my kids do (I promise you, your shoes are not in the ceiling. You can stop looking there)
@vaidyanathanag6463 Před 8 měsíci
Can you link the paper for reference ?
@DataIndependent Před 8 měsíci
Yep I put it in the description
@feralmachine Před 8 měsíci ⁺¹
PG writes about SF occasionally. Wouldn’t this test be more definitive if you had changed the city name to one that PG has never written about? Perhaps the error rate simply increased with context length because more of PG’s opinions on SF got included in the text, and not because of hallucination. That aside, thank you for doing this test and how much $ it cost. I think this kind of thing is the most interesting and valuable kind of content. 🙌
@DataIndependent Před 8 měsíci
Nice! Thank you for that and you're totally right. Small variations in the text/question would produce different results.
I was told after the test was done that I could have done a key:value pair UUID retrieval
Ex: "What is the value for this key? ad1491f3-d899-495b-8fea-7f07a7c6a602?"
But that felt boring even if it was technically more correct. My goal was to kick off the conversation rather than claim definitive results (hence a tweet write up vs a paper and peer reviewed)
@caiyu538 Před 8 měsíci
👍
@hanzo_process Před 2 měsíci
👍👍👍
@moxes8237 Před 5 měsíci
What’s temperature in A.I Language?
Also, congratulations on Google using your Needle in a haystack method 👍
@DataIndependent Před 5 měsíci
Thank you! It was 0
@michelefruscella7373 Před 8 měsíci
In practice, we've built artificial brains, and now we need neurologists (data scientists) like you to study them.
@sanz1996_ Před 8 měsíci
Now try the same test but not with some random sentence but some content related to the content of document.
Which would be a more practical test.
I uploaded a lengthy medical benefits document and the answer were close enough to a human's response
@mayanksingh3366 Před 8 měsíci
Hi Greg, Hope will meet one day. One more thing, I am 20 yrs old, so according to my age, what should I call you, Uncle, Bro or anything else.
@DataIndependent Před 8 měsíci ⁺¹
Let’s go with ‘Greg’ whenever anyone opens up with ‘bro’ I close the message
@scharlesworth93 Před 8 měsíci
Call him Unclebro
@Divyv520 Před 8 měsíci
Hey Greg , really nice video! I was wondering if I could help you enhance Editing in your videos and also make a highly engaging Thumbnail which will help your video to reach to a wider audience .
@micbab-vg2mu Před 8 měsíci
Thank you - very usuful.
@DataIndependent Před 8 měsíci
Thanks Micbab

Další v pořadí

Automatické přehrávání

I figured out what GPT-4 Vision could do