ARC Prize
ARC Prize
  • 11
  • 23 873
Testing Frontier LLMs (GPT4) on ARC-AGI
Template: www.kaggle.com/code/gregkamradt/using-frontier-models-on-arc-agi-via-langchain?scriptVersionId=184611945
arcprize.org/leaderboard
arcprize.org/arc-agi-pub
ARC Prize is a $1,000,000+ public competition to beat and open source a solution to the ARC-AGI benchmark.
Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras).
--
Website: arcprize.org/
Twitter/X: arcprize
Newsletter: Signup @ arcprize.org/
Discord: discord.gg/9b77dPAmcA
Try your first ARC-AGI tasks: arcprize.org/play
zhlédnutí: 2 347

Video

Francois Chollet recommends this method to solve ARC-AGI
zhlédnutí 6KPřed měsícem
ARC Prize is a $1,000,000 public competition to beat and open source a solution to the ARC-AGI benchmark. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras). Website: arcprize.org/ Twitter/X: arcprize Newsletter: Signup @ arcprize.org/ Discord: discord.gg/9b77dPAmcA Try your first ARC-AGI tasks: arcprize.org/play
ARC Benchmark Origins
zhlédnutí 1,9KPřed měsícem
ARC Prize is a $1,000,000 public competition to beat and open source a solution to the ARC-AGI benchmark. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras). Website: arcprize.org/ Twitter/X: arcprize Newsletter: Signup @ arcprize.org/ Discord: discord.gg/9b77dPAmcA Try your first ARC-AGI tasks: arcprize.org/play
Implications of solving the ARC benchmark
zhlédnutí 2,4KPřed měsícem
ARC Prize is a $1,000,000 public competition to beat and open source a solution to the ARC-AGI benchmark. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras). Website: arcprize.org/ Twitter/X: arcprize Newsletter: Signup @ arcprize.org/ Discord: discord.gg/9b77dPAmcA Try your first ARC-AGI tasks: arcprize.org/play
Explore ARC-AGI Data + Play
zhlédnutí 4,5KPřed měsícem
ARC Prize is a $1,000,000 public competition to beat and open source a solution to the ARC-AGI benchmark. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras). Website: arcprize.org/ Twitter/X: arcprize Newsletter: Signup @ arcprize.org/ Discord: discord.gg/9b77dPAmcA Try your first ARC-AGI tasks: arcprize.org/play
Announcing ARC Prize
zhlédnutí 2,8KPřed měsícem
ARC Prize is a $1,000,000 public competition to beat and open source a solution to the ARC-AGI benchmark. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras). Website: arcprize.org/ Twitter/X: arcprize Newsletter: Signup @ arcprize.org/ Discord: discord.gg/9b77dPAmcA Try your first ARC-AGI tasks: arcprize.org/play
Welcome To ARC Prize - Mike & Francois
zhlédnutí 1,4KPřed měsícem
ARC Prize is a $1,000,000 public competition to beat and open source a solution to the ARC-AGI benchmark. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras). Website: arcprize.org/ Twitter/X: arcprize Newsletter: Signup @ arcprize.org/ Discord: discord.gg/9b77dPAmcA Try your first ARC-AGI tasks: arcprize.org/play
Francois Chollet On LLMs w/ Active Inference
zhlédnutí 1,6KPřed 2 měsíci
ARC Prize is a $1,000,000 public competition to beat and open source a solution to the ARC-AGI benchmark. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras). Website: arcprize.org/ Twitter/X: arcprize Newsletter: Signup @ arcprize.org/ Discord: discord.gg/9b77dPAmcA Try your first ARC-AGI tasks: arcprize.org/play

Komentáře

  • @sp3ct3rgaming46
    @sp3ct3rgaming46 Před 3 dny

    i might be tripping but i think this dude cloned his own voice and then layered it into the video. you can hear the typical elevenlabs lisp

    • @ARCprize
      @ARCprize Před 3 dny

      @@sp3ct3rgaming46 you’re tripping. I did the video and no voice dub used

  • @mosca204
    @mosca204 Před 3 dny

    Why is the train/evaluation set so small?

    • @ARCprize
      @ARCprize Před 3 dny

      The tasks are handmade which limit the scale that can be done. They focus on diversity rather than quantity at this stage

  • @DistortedV12
    @DistortedV12 Před 4 dny

    Mark Knoop is a very smart guy. Knows a lot about ML for being an outsider in that space and meche background

  • @tankieslayer6927
    @tankieslayer6927 Před 5 dny

    I am not entirely convinced about active inference. Training on unlabelled test data is an old trick on Kaggle to improve score but does it really say anything about intelligence?

  • @jackq2331
    @jackq2331 Před 6 dny

    Excellent.

  • @johnkintner
    @johnkintner Před 6 dny

    third since no one called it :kappa:

  • @alvaromros8127
    @alvaromros8127 Před 7 dny

    Does your submission count if you make use of private models like gpt4 at some point in your algorithm?

  • @davefaulkner6302
    @davefaulkner6302 Před 9 dny

    So he is saying that LLMs guiding hyperparameter space will get us to AGI? Seems a little simplistic to me ... and there are better ways to search that space.

  • @CitsVariants
    @CitsVariants Před 12 dny

    Upp

  • @LimeTubeH
    @LimeTubeH Před 13 dny

    I'm confused...what are we supposed to attach with our API add-on secret?

    • @ARCprize
      @ARCprize Před 12 dny

      What do you mean attach? That’s where you put your API key and then reference it in your code

  • @MarkoTManninen
    @MarkoTManninen Před 13 dny

    I understand retries, but I am confuced with the two attempts. Do you always need to provide two? In which case they would have different data and both would be required for 100% correct prediction? I also missed the part in which the prediction and correct answers are matched and prounounced.

    • @ARCprize
      @ARCprize Před 13 dny

      Sorry this isn't more clear on the video! You get two tried at each task. Old competitions had 3 tries. So you can basically give two attempts. If either are correct you pass the task. Under scoring methodology there is more information: arcprize.org/guide#submissions

  • @anantkeepershome4327
    @anantkeepershome4327 Před 14 dny

    When you think about it, the optimal network should be like a physics simulator, every example has its own stable rules. My guess is that a recurrent network would have the best chance. Though the parameter count would need to be huge so we could perhaps make a Hypernet to generate the weights from scratch.

  • @aluphshahim5808
    @aluphshahim5808 Před 14 dny

    Second 😂

  • @michealkinney6205
    @michealkinney6205 Před 14 dny

    How did this change so much, so quickly. ARC-AGI-PUB now has code that gets ~42%. Looking at it now, and it's pretty awesome.

  • @conformist
    @conformist Před 14 dny

    first.

    • @cyb3rvoid
      @cyb3rvoid Před 14 dny

      That was unreal!

    • @conformist
      @conformist Před 14 dny

      @@cyb3rvoid for my next magic trick, i will solve the agi price first

    • @wwkk4964
      @wwkk4964 Před 14 dny

      ​@@conformistsolve it backwards!

    • @filipgara3444
      @filipgara3444 Před 14 dny

      Ensure diversity in your model

  • @volotat
    @volotat Před 15 dny

    Here is my little observation I noticed while playing with ARC. It is very dependent on the assumed geometry of the grid. Meaning, that if we imagine that each cell is a node of the graph and we do not know how the graph is connected we cannot build a program that describes the transformations of the particular task. So we need to assume some way it is connected - i.e. geometry of the grid. For each task the geometry is different and depending on how well we assumed the geometry the program that ‘solves’ the task might be either simple or complicated. There are several consequences that follow from this. Firstly, for each task there might be a huge amount of "correct" solutions that exist on the unfamiliar geometries but with very simple generating programs. Such solutions most likely will not be qualified as a correct one. Secondly, to solve ARC an AI should be aligned to the expectations of the human of how the proper task should look like and might be solved. And lastly, most of the ARC-like tasks generated via some algorithm that assumes random geometry of the grid and simple program would be completely unsolvable for almost any human in most of the cases. So, my main point is, ARC-like tasks exist in the much broader space that we tend to assume. And the only reason they are solvable for us at all is that the subset of them that was created by François reminds us of tasks we have already encountered and we can reliably guess what François had in mind while creating them.

  • @geospatialindex
    @geospatialindex Před 18 dny

    Sorry this isn’t general intelligence. This is just reasoning. It is painful watching a whole industry trying to reinvent psychology when there is shady a century of research there.

    • @ARCprize
      @ARCprize Před 16 dny

      Thanks for the comment! We'd love to hear your ideas and thoughts about how to get closer to AGI

  • @geospatialindex
    @geospatialindex Před 18 dny

    So have you collaborated with any psychologists to make this test

    • @ARCprize
      @ARCprize Před 16 dny

      Check out section 11.1 of Measure/Intelligence. Francois digs into his influence of human psychology

  • @andreaspatounis5674
    @andreaspatounis5674 Před 18 dny

    what is the relation between the precentage of questions the AI gets correct and the iq of the models

    • @ARCprize
      @ARCprize Před 16 dny

      How do you measure the IQ of the model?

    • @andreaspatounis5674
      @andreaspatounis5674 Před 16 dny

      @@ARCprize what I meant to say was how rare is it for a human to get 39% and how rare is it to get 85%

  • @JirkaKlimes_
    @JirkaKlimes_ Před 18 dny

    It can't be that hard right

    • @ARCprize
      @ARCprize Před 16 dny

      Try it out! We'd love to see a submission

  • @Aemond-qj4xt
    @Aemond-qj4xt Před 18 dny

    i think i might have unintentionally set the basis for solving this in a project i did a couple months ago

    • @ARCprize
      @ARCprize Před 16 dny

      We'd love to see a submission!

    • @Aemond-qj4xt
      @Aemond-qj4xt Před 14 dny

      @@ARCprize working on it i just handed in my graduation project i have time to work on this now

  • @sebastianlowe7727
    @sebastianlowe7727 Před 20 dny

    Is this related to P vs NP in an interesting way?

    • @abominablealias4514
      @abominablealias4514 Před 18 dny

      No

    • @thobiasknudsen1730
      @thobiasknudsen1730 Před 7 dny

      If some really smart model is made then it could make a faster solution to any np problem so that it becomes p problem, which would make the model the proof to n=np

  • @DistortedV12
    @DistortedV12 Před 20 dny

    I have an idea now, thanks. I’ll probably check out ARC after my PhD qualifying exam. Finetuning is gonna be fun 🤩

  • @BigFatSandwitch
    @BigFatSandwitch Před 20 dny

    The question is if someone manages to get high score in ARC style problems will they really share the code for such little amount of money rather than raising money from venture capital farms based on that for their startup.

    • @ARCprize
      @ARCprize Před 20 dny

      We encourage an open source solution to ARC-AGI!

  • @robbielualhati1731
    @robbielualhati1731 Před 22 dny

    There are also birds such as Sulphur Crested Cockatoos that have shown problem solving skills. Hopefully it's proof enough that a basic reasoning model won't require a trillion parameters.

  • @neilashtekar1329
    @neilashtekar1329 Před 23 dny

    IMO, solving ARC is necessary for AGI, though it may not be sufficient It's not obvious to me whether a solution to ARC would generalize to other problems Like Francois mentioned, no benchmark is perfect, and there may be ways to "cheese" ARC!

    • @JirkaKlimes_
      @JirkaKlimes_ Před 21 dnem

      If people just keep throwing augmentation at it than yeah, stupid LLM memorization works and no AGI for us

  • @harithnawalage9355
    @harithnawalage9355 Před 23 dny

    It's a good idea that the solution has to be made open source.

    • @ARCprize
      @ARCprize Před 23 dny

      Awesome, yes, this is to counter act closed-AI research recently

  • @ignaciosavi7739
    @ignaciosavi7739 Před 24 dny

    Lets get to the bottom of this. How much for getting 90 % accuracy on a free llm model? How much do i get for that?

    • @ARCprize
      @ARCprize Před 23 dny

      The threshold for a Kaggle score is 85%, reach that with a valid submission and you're eligible for a prize

    • @ignaciosavi7739
      @ignaciosavi7739 Před 22 dny

      @@ARCprize thanks

  • @ignaciosavi7739
    @ignaciosavi7739 Před 24 dny

    I advise you to not get that cooky. Otherwise ill solve everything and demand my prize 😂

    • @ARCprize
      @ARCprize Před 23 dny

      We're excited to see you on the leaderboard!

    • @ignaciosavi7739
      @ignaciosavi7739 Před 22 dny

      @@ARCprize sheesh ok I'll be there;-)

  • @bedev1087
    @bedev1087 Před 25 dny

    Did Francois say he would like to see a solution which can solve these puzzles without having been trained on a lot of “ARC like” input output pairs? This benchmark seems to exist as a subset of the permutation group of operations on coloured grids (expanding/contracting grid, extruding masses, rotating masses, filling in holes, adding single colours, etc…) If the “core knowledge” claim is true about ARC, then the discovery of the correct set of “core knowledge” matrix operations could be used to synthetically generate a dataset. You could then sample thousands of games from a pre-trained policy network given only the first training input, and reinforce the trajectories closest to the revealed answer. Then sample games given the 1st training input, the first training output, and the 2nd training input, and reinforce again before the test set. Or is this not allowed?

    • @ARCprize
      @ARCprize Před 24 dny

      Francois said he would *like* to see a solution that isn't trained on a bunch of input/output ARC tasks but there isn't a rule that says this isn't allowed. If as long as your submission only makes 2 final attempts per task you can use it. This means you can test and iterate on the example pairs as much as you'd like.

    • @bedev1087
      @bedev1087 Před 24 dny

      ​ @ARCprize Hey thanks for the reply! :) You guys obviously have the generating operations locked away, so this method would only be able to train on "ARC like" operations, So can I ask what his intuition is on a model being able to center on "core knowledge" priors from training on tasks orthogonal to ARC? Thanks for making such a great opportunity for the community 👍

  • @tycrenshaw6968
    @tycrenshaw6968 Před 25 dny

    I dont know if he is nervous or what but his is totally red and looks like it is hurting alot.

    • @ARCprize
      @ARCprize Před 23 dny

      Francois is on fire, with knowledge.

  • @omarnomad
    @omarnomad Před 25 dny

    Is there a way to know all the priors you embed into the puzzles? So far I’ve identified: 1. Translations - Shifting objects or patterns across the grid. 2. Rotations - Rotating objects or patterns at different angles. 3. Reflections - Flipping objects or patterns across a line. 4. Scaling - Changing the size of objects or patterns. 5. Repetition and symmetry - Repeating patterns or creating symmetrical designs. 6. Color changes - Altering the color of objects or patterns. 7. Compositions - Combining multiple operations or transformations. 8. Object addition or removal - Adding or removing elements within the grid. 9. Changes of the size matrices - Modifying the dimensions of the grid or the objects within it.

    • @ARCprize
      @ARCprize Před 24 dny

      There have been a bunch of attempts at this. Table 4 on this paper leans that direction arxiv.org/pdf/2403.11793 There isn't a way to know all the priors, this is essentially helping give the answer to the test set

  • @guseynguliyev707
    @guseynguliyev707 Před 26 dny

    The video gives me such interesting vibes! I feel like it will bring thousands of smart minds all working on a single problem, and I can't wait to see the results! Humanity is truly amazing when we work together on something

  • @-mwolf
    @-mwolf Před 26 dny

    These recent code diffusion models sound simmilar to this idea 🤔

  • @gustavnilsson6597
    @gustavnilsson6597 Před 26 dny

    Perhaps we want the model to search actively as humans do by manipulating it's environment. I don't think this problem is going to be easy to solve.

  • @ps3301
    @ps3301 Před 26 dny

    If u cant design an ai architecture to solve this problem, u arent as smart as you think.

    • @denisblack9897
      @denisblack9897 Před 25 dny

      Hot take) Don’t forget to design great design to sell subscriptions😅

  • @wwkk4964
    @wwkk4964 Před 26 dny

    This can be achieved fairly easily by employing a diffusion based LLM reasoning module that has accurate image captioning capabilities that ignore irrelevant details when feeding the llm during inference or test time?

    • @Hohohohoho-vo1pq
      @Hohohohoho-vo1pq Před 25 dny

      Stop overusing the term LLM. GPTs are not LLMs. LLMs are GPTs.

    • @wwkk4964
      @wwkk4964 Před 25 dny

      @@Hohohohoho-vo1pq please refer to the presentation here. Author of the research calls it LLM. czcams.com/video/kYtvqbgCxFA/video.htmlsi=nrFEIED7mmZAE_Zf

    • @RecursiveTriforce
      @RecursiveTriforce Před 25 dny

      ​@@Hohohohoho-vo1pq LLMs can be GPTs. GPTs can be LLMs. GPT is the architecture. LLM is the size. They neither imply nor contradict each other.

    • @Hohohohoho-vo1pq
      @Hohohohoho-vo1pq Před 25 dny

      @@RecursiveTriforce reducing GPTs to as "mere LLMs" is very misleading. People don't even understand what that means when they say that.

  • @el_chivo99
    @el_chivo99 Před 26 dny

    chollet has been my favorite voice on AI for 5+ years and I don't see that changing!

  • @bladekiller2766
    @bladekiller2766 Před 27 dny

    This is how Stockfish (Chess Engine) works. Not sure whether it will lead to AGI.

    • @ARCprize
      @ARCprize Před 27 dny

      We'd love a submission that tried this approach to see how it goes - super interesting

  • @fayezsalka
    @fayezsalka Před 28 dny

    But we, as humans, don’t do tree search when solving ARC. We solve it in one shot, almost immediately without trying / searching through different solutions, because the spacial patterns look very obvious, no? Large multimodal models with ability to input AND output images natively will be able to solve this in one shot. Case in point: gpt5 with native image output.

    • @stevo-dx5rr
      @stevo-dx5rr Před 27 dny

      I’m not a researcher, but the notion that ‘this is obviously not what humans do’ seems moot given that the same can be said about transformers.

    • @ARCprize
      @ARCprize Před 27 dny

      > We solve it in one shot > Spacial patterns look very obvious The human brain is very good at having 'intuition' to prune a search space. Though it may happen quickly, there are many (maybe infinite) possibilities for decisions humans can make, but yet we prune that down to just a few very quickly

    • @stevo-dx5rr
      @stevo-dx5rr Před 27 dny

      @@ARCprize What do you think of MITs “Introduction to Program Synthesis” course as a starting point?

    • @fayezsalka
      @fayezsalka Před 26 dny

      @@ARCprize this implies that multi modal models can do much better if there is an internal “for loop” that allows it to iterate through different solutions in the hidden space manifold before decoding into output. Only question becomes how to train such a model and what dataset / loss function objective can be used. Alternatively, would we be able to imitate such a process by having the “thinking” happen in the decoded output in an auto regressive fashion? We human have the ability to “think out loud” as one option and having an LLM think in the decoded output space might make it easier and more familiar to train (Basically, is chain of thought considered one crude form of discreet search?)

    • @shawnvandever3917
      @shawnvandever3917 Před 25 dny

      We don't do it in one shot. We go through 100s or 1000s of prediction updates to answer a question. We update our mental models in real time while this is happening.

  • @2394098234509
    @2394098234509 Před 28 dny

    Love this

  • @MsFerroCat
    @MsFerroCat Před 28 dny

    im glad somebody came there to share the misconception that most Models that exist nowadays are NOT ai but just an overly complex imitating machine even tho it can imitate a lot, learning skills is more of a natural thing, yet not described with code, witch is the goal here, witch is so fun to realize that its more like the process of creating artificial life more than anything

  • @IGobzter
    @IGobzter Před 29 dny

    Some real brainpower behind this project. I've longed for a counter example that really signifies the shortcomings of current SOTA models, it was a gut feeling but now we can actually know with some certainty!

    • @ARCprize
      @ARCprize Před 27 dny

      Awesome! Let us know if you have any questions along the way

  • @Pomirkovany
    @Pomirkovany Před 29 dny

    ok thanks, I'll try to build this solution with ChatGPT

  • @user-pl4pz2xn2c
    @user-pl4pz2xn2c Před 29 dny

    children can solve these puzzles but i dont think LLM's can

    • @ARCprize
      @ARCprize Před 29 dny

      We haven't seen LLM do this yet

    • @shure-youtube
      @shure-youtube Před 24 dny

      @@ARCprize How about VLM? I think this task requires strong spatial understanding.

    • @ignaciosavi7739
      @ignaciosavi7739 Před 24 dny

      How?​@@ARCprize

  • @s.dotmedia
    @s.dotmedia Před 29 dny

    Let's go! I'm all in on this, I will say: don't count out the power of one shot

    • @ARCprize
      @ARCprize Před 29 dny

      Nice! Love it - let us know if you need anything along the way

  • @InfiniteQuest86
    @InfiniteQuest86 Před měsícem

    Yeah I'm excited about this, but isn't it likely that the best solution to ARC will be to treat it like a specialized game like chess? I'm not sure you need intelligence to do well on the problem set. It's very well defined. I can see doing pretty well on the examples by treating each problem category independently. if(moving problem) do moving solution. if(filling out problem) do filling out pattern. If(color change problem) do color change solution. if(subset problem) take the subset. etc.

    • @ARCprize
      @ARCprize Před 29 dny

      the problem is that "moving problem" or "filling out problem" are programs/techniques you see in the training data, on the test set there are programs/techniques that haven't been seen before So your model would need to adapt to those

  • @mriz
    @mriz Před měsícem

    thx for demonstrations, this taks feel like arbitrarily single step arbitrary state transition in cellular automaton. It also looks like fun to play 😄

    • @ARCprize
      @ARCprize Před 29 dny

      Nice! Yes please go try it out and let us know what you think

  • @InfiniteQuest86
    @InfiniteQuest86 Před měsícem

    Thank you! I know this has been around for a while, but I'm happy to see a legitimate attempt at testing intelligence that isn't "It passed the Turing test." LLMs sound smart because they speak our language, but are they really doing anything more than regurgitating memorized information? This test shows that most likely not really.

  • @ctejada-0
    @ctejada-0 Před měsícem

    Thank you for this.

    • @ARCprize
      @ARCprize Před měsícem

      We're excited! Thank you!