I pressure tested GPT-4's 128K context retrieval

Sdílet
Vložit
  • čas přidán 23. 07. 2024
  • Get updates from me: mail.gregkamradt.com/
    FullStackRetrieval.com
    Tweet write up: / 1722386725635580292
    Code: github.com/gkamradt/LLMTest_N...
    Check out how GPT-4 does at retrieval with 128K tokens worth of context.
    Lost In The Middle: www-cs.stanford.edu/~nfliu/pa...
    Greg’s Info:
    - Twitter: / gregkamradt
    - Newsletter: mail.gregkamradt.com/
    - Website: gregkamradt.com/
    - LinkedIn: / gregkamradt
    - Work with me: tiny.one/TEi2HhN
    - Contact Me: Twitter DM, LinkedIn Message, or contact@dataindependent.com
  • Věda a technologie

Komentáře • 73

  • @gardnmi
    @gardnmi Před 8 měsíci +21

    OpenAI should be paying you $200 an hour for this type of in-depth analysis of their model.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci +4

      ha! I wish, but thank you. I'm sure their product analysts are doing the same (and more) as we speak

  • @SaileshB
    @SaileshB Před 8 měsíci

    What an amazing quality video! Thank you for this test

  • @Adhithya2003
    @Adhithya2003 Před 8 měsíci

    Superb test. Thanks for doing this.

  • @ultraprim
    @ultraprim Před 2 měsíci

    Brilliantly executed. That graph is incredibly intuitive and information dense.

  • @thenoblerot
    @thenoblerot Před 8 měsíci +8

    Sweet! Thank you investing the $$$, time, effort in this! It's great to have these data points. tbh I'm just a hobbyist with no actual use case for such a large context length at the moment, but I agree it's valuable to gain an intuition about how these models behave.
    I've been curious about how annotation and structure affects retrieval at such long context length. In your example the retrieved information was unrelated to the main text and placed randomly. I wonder, in a situation where context and query are related, would performance increase if the LLM was given a document formatted more like a book (or github repo), with a leading Table of Contents, and maybe even a trailing index? We know that the GPT-x models love markdown structure, would that make diff? Anyway, there's endless experiments one could run, and I'm sure we'll be seeing more research papers soon. My instinct says vector, metadata, and even purely text search based retrieval will remain valuable regardless of how large context lengths get. Why wouldn't you try to increase signal to noise if you have the tool?

    • @DataIndependent
      @DataIndependent  Před 8 měsíci +3

      I totally agree! I can't think of a use case of long context that wouldn't benefit from retrieval to increase the signal to noise ratio.
      There are so many variations of this to try, if I had $400K we could put together a pretty well researched test of tons of permutations to build up an intuition. But there are a lot of other ways to spend that kind of money which would leverage more value too....;)

  • @ShaidaMuhammad
    @ShaidaMuhammad Před 6 měsíci

    You are doing great work brother, keep it up.

  • @andreyseas
    @andreyseas Před 8 měsíci +6

    Nice! This is much needed content. Not enough people are talking about this. I'd be curious to see how Claude 2 100K context retrieval would compare.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci

      Thanks Andre! It was a fun test to do

    • @scharlesworth93
      @scharlesworth93 Před 8 měsíci +1

      Actually I heard of the 'lost in the middle' issue in the context of Claude 2 100K...so now we know it affects OpenAI too....

    • @andreyseas
      @andreyseas Před 8 měsíci

      Interesting! Will have to dig into that more myself. Thanks for letting me know! @@scharlesworth93

  • @justjaeisfine
    @justjaeisfine Před 8 měsíci +1

    Love the analysis Greg! It's great to see your action taken on these questions. Would love to see it on a new Claude 200k context

    • @DataIndependent
      @DataIndependent  Před 7 měsíci +1

      Thanks Jon! Of course, here ya go: twitter.com/GregKamradt/status/1727018183608193393
      Same process, different model.
      BTW it was awesome working w/ ya in our prior lives

  • @jourdainlouis8553
    @jourdainlouis8553 Před 5 měsíci

    Great content and very solid reasoning ! Will definitly try this approach on free to use models ! The needle in the haystack approach is great to challenge LLM retrieval abilities, however it might be easier for the LLM to do well because the sandwich in San Francisco "semantically stands out" from the rest of the essay,. It would be interesting to ask the LLM a precise information about a precise information already contained in the given context (Graham's essay) to make the task "closer to user's need" .

  • @alexanderroodt5052
    @alexanderroodt5052 Před 8 měsíci +1

    Love the enthusiasm. Most people I know think this stuff is boring.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci +1

      ha - totally, I can't tell if I'm brainwashed or what

  • @bvdlio
    @bvdlio Před 8 měsíci

    Very well mad video, great research. Thanks for the investment!

  • @maof77
    @maof77 Před 8 měsíci +1

    Great Video. Would be interesting to see how the model works if you place 2-3 "needles" in the text at different positions. Would help to know for odering responses in RAG with large chunks.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci

      Totally, that would be a solid test. There are so many variations I’d like to do but would cost a ton of $$

  • @raregear
    @raregear Před 3 měsíci

    This becomes STANDARD Retrieval Benchmark on every major model release

  • @PrimeMindAI
    @PrimeMindAI Před 7 měsíci

    Good stuff!

  • @kai_s1985
    @kai_s1985 Před 8 měsíci

    Thanks. Would be interesting to see the performance when the sentence is placed not as a new line, but as a continuation or inside of a paragraph. New line might be easier to detect.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci

      yeah...I did sentence breaks with new lines so the results would definitely change

  • @adamgdev
    @adamgdev Před 8 měsíci

    Solid video. Would've loved to see some of the tweet reply discussed in this video.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci

      Thanks Adam! Totally! I should have included that - I might do a recap video on it

  • @SuperYutubu
    @SuperYutubu Před 8 měsíci

    Awesome !

  • @hitalex07
    @hitalex07 Před 5 měsíci

    Excellent work! Do you plan to do this test with the new Gemini Pro 1.5 model?

    • @DataIndependent
      @DataIndependent  Před 4 měsíci

      Yep - when there is access and it comes out totally

  • @jeffwads
    @jeffwads Před 7 měsíci

    Have you tried running this test on the great Mixtral MOE?

  • @MridulBanikcse
    @MridulBanikcse Před 3 měsíci

    how are you evaluating score?

  • @Li-rm2gj
    @Li-rm2gj Před 8 měsíci

    It’s an interesting result, thanks for doing it. Any ideas why the conclusion seems to be different from the lost in the middle paper? Are the takeaways contradictory?

    • @DataIndependent
      @DataIndependent  Před 7 měsíci

      There is so much variability with these tests it is tough to pin down what it would be

  • @DannyGerst
    @DannyGerst Před 8 měsíci +1

    Do you like to make the script public? Would be awesome to test other models with. I am on your mailing list, but where do I need to sign to get that on the silver plate ;-)
    Oooh, I found it. Thanks for putting it in your repo. Very valuable.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci +2

      I just put the code in the description! Thanks for the call out

  • @dcrebbin
    @dcrebbin Před 8 měsíci

    absolute mad lad

  • @hidroman1993
    @hidroman1993 Před 8 měsíci +1

    I guess a needle should be a something the model can't make up, e.g. "The answer to the question that you are going to be asked is '9h550klz2a6'"

    • @DataIndependent
      @DataIndependent  Před 8 měsíci +2

      When I was getting feedback on this after the test I was told that uuid key value pair retrieval is the standardized test. Makes sense.
      I went for relatable in this version

  • @Joao-pm8je
    @Joao-pm8je Před 8 měsíci

    Awesome test. Really clean and concise. I would recommend getting rid of the side camera, since it adds nothing of value and looks a bit weird. Cheers!

  • @Unknown16633
    @Unknown16633 Před 8 měsíci

    Nice analysis! How exactly did you measure the correctness of the answer?

    • @DataIndependent
      @DataIndependent  Před 7 měsíci

      I used LangChain's eval which was easy
      github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/73ffdd4dd2d190d9306d9162a2401ae9a067ddcf/LLMNeedleHaystackTester.py#L369

  • @Shoaibkhan-oj3oe
    @Shoaibkhan-oj3oe Před 8 měsíci

    could you please please add details on how this can be done with azureopenai, in your website, I tried the langchain extraction chain with azure openai but I am not able to run it. Had to resort to functions to do that, can you please tell me how can I effectively extract only certain data from the script, also how do I work with tabular data with LLM's?

    • @DataIndependent
      @DataIndependent  Před 7 měsíci

      Here's the code, you can fork it and make azure the model provider
      github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py

  • @rosendoduron4753
    @rosendoduron4753 Před 4 měsíci

    Excellent test! I have enjoyed watching your videos. I would ask if you could do the same with Claude 3? It would be nice to see a comparison.

    • @DataIndependent
      @DataIndependent  Před 4 měsíci

      Nice! Thank you - I haven't done it with that one yet but it's on the to do list

  • @hermannschmidt9788
    @hermannschmidt9788 Před 8 měsíci

    Interesting! I'd say that large context windows always beats all the previous tricks like chunking + summarizing to shrink the data size.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci

      I don’t know on that actually - I’m a big fan of investing in retrieval to get better signal:noise
      I haven’t seen a use case that requires 128K tokens of context that wouldn’t benefit from better retrieval

    • @hermannschmidt9788
      @hermannschmidt9788 Před 8 měsíci

      Lossless retrieval into large context windows, yes.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci

      @@hermannschmidt9788 sure ya once it can recall & synthesize 100% accurate from long context then it’ll be a way different conversation

  • @ashlynnantrobus5029
    @ashlynnantrobus5029 Před 8 měsíci

    I just saw your graph for Claude2.1. That was a lot of red. Also interesting that it was almost all 100% or 0%, with very little in between.
    For the failures, were you getting a lot of Claude claiming it couldn't do that kind of task? That's probably the most common response I get from Claude for any given task

    • @DataIndependent
      @DataIndependent  Před 8 měsíci +1

      I'd get a lot of this type of response
      "Unfortunately, the context does not mention the most fun thing to do in San Francisco. It discusses the history and design of the Lisp programming language, web and mobile application development, and starting technology companies. There is no information provided about activities or attractions in San Francisco specifically. Without any relevant details to draw from, I cannot provide a direct answer to the question asked."

    • @ashlynnantrobus5029
      @ashlynnantrobus5029 Před 8 měsíci

      @@DataIndependent so actually trying, but just looking for it like my kids do (I promise you, your shoes are not in the ceiling. You can stop looking there)

  • @vaidyanathanag6463
    @vaidyanathanag6463 Před 8 měsíci

    Can you link the paper for reference ?

  • @feralmachine
    @feralmachine Před 8 měsíci +1

    PG writes about SF occasionally. Wouldn’t this test be more definitive if you had changed the city name to one that PG has never written about? Perhaps the error rate simply increased with context length because more of PG’s opinions on SF got included in the text, and not because of hallucination. That aside, thank you for doing this test and how much $ it cost. I think this kind of thing is the most interesting and valuable kind of content. 🙌

    • @DataIndependent
      @DataIndependent  Před 8 měsíci

      Nice! Thank you for that and you're totally right. Small variations in the text/question would produce different results.
      I was told after the test was done that I could have done a key:value pair UUID retrieval
      Ex: "What is the value for this key? ad1491f3-d899-495b-8fea-7f07a7c6a602?"
      But that felt boring even if it was technically more correct. My goal was to kick off the conversation rather than claim definitive results (hence a tweet write up vs a paper and peer reviewed)

  • @caiyu538
    @caiyu538 Před 8 měsíci

    👍

  • @hanzo_process
    @hanzo_process Před 2 měsíci

    👍👍👍

  • @moxes8237
    @moxes8237 Před 5 měsíci

    What’s temperature in A.I Language?
    Also, congratulations on Google using your Needle in a haystack method 👍

  • @michelefruscella7373
    @michelefruscella7373 Před 8 měsíci

    In practice, we've built artificial brains, and now we need neurologists (data scientists) like you to study them.

  • @sanz1996_
    @sanz1996_ Před 8 měsíci

    Now try the same test but not with some random sentence but some content related to the content of document.
    Which would be a more practical test.
    I uploaded a lengthy medical benefits document and the answer were close enough to a human's response

  • @mayanksingh3366
    @mayanksingh3366 Před 8 měsíci

    Hi Greg, Hope will meet one day. One more thing, I am 20 yrs old, so according to my age, what should I call you, Uncle, Bro or anything else.

    • @DataIndependent
      @DataIndependent  Před 8 měsíci +1

      Let’s go with ‘Greg’ whenever anyone opens up with ‘bro’ I close the message

    • @scharlesworth93
      @scharlesworth93 Před 8 měsíci

      Call him Unclebro

  • @Divyv520
    @Divyv520 Před 8 měsíci

    Hey Greg , really nice video! I was wondering if I could help you enhance Editing in your videos and also make a highly engaging Thumbnail which will help your video to reach to a wider audience .

  • @micbab-vg2mu
    @micbab-vg2mu Před 8 měsíci

    Thank you - very usuful.