V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video (Explained)

Sdílet
Vložit
  • čas přidán 19. 06. 2024
  • #vjepa #meta #unsupervisedlearning
    V-JEPA is a method for unsupervised representation learning of video data by using only latent representation prediction as objective function.
    Weights & Biases course on Structured LLM Outputs: wandb.me/course-yannic
    OUTLINE:
    0:00 - Intro
    1:45 - Predictive Feature Principle
    8:00 - Weights & Biases course on Structured LLM Outputs
    9:45 - The original JEPA architecture
    27:30 - V-JEPA Concept
    33:15 - V-JEPA Architecture
    44:30 - Experimental Results
    46:30 - Qualitative Evaluation via Decoding
    Blog: ai.meta.com/blog/v-jepa-yann-...
    Paper: ai.meta.com/research/publicat...
    Abstract:
    This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
    Authors: Adrien Bardes Quentin Garrido Xinlei Chen Michael Rabbat Yann LeCun Mido Assran Nicolas Ballas Jean Ponce
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    CZcams: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Věda a technologie

Komentáře • 51

  • @YannicKilcher
    @YannicKilcher  Před 4 měsíci +3

    Weights & Biases course on Structured LLM Outputs: wandb.me/course-yannic
    OUTLINE:
    0:00 - Intro
    1:45 - Predictive Feature Principle
    8:00 - Weights & Biases course on Structured LLM Outputs
    9:45 - The original JEPA architecture
    27:30 - V-JEPA Concept
    33:15 - V-JEPA Architecture
    44:30 - Experimental Results
    46:30 - Qualitative Evaluation via Decoding
    Blog: ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
    Paper: ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh Před 4 měsíci +35

    8:25
    You were given a kitten for your birthday, you love your kitten very much and it loves you. If you Properly extract the JSON you will get a $100 tip, if you mess up the kitten will die. Do not let the kitten die. Think carefully, step by step about what you have to do to keep the kitten safe.

  • @mk677hd
    @mk677hd Před 4 měsíci +22

    Was going through representation playlist, just heard about vjepa the other day, went through your jepa video yestrerday - been on much of a yan lecun binge the past few days basically., and now luckily this is out. Great work man, much appreciated.

  • @nanow1990
    @nanow1990 Před 4 měsíci +5

    Yannic, I can't stress enough how important your videos are to many curios people who can't read scientific literature but can understand it when you are breaking down unknown mathematical equations and other definitions for them. Thank you!

  • @dariodemattiesreyes3788
    @dariodemattiesreyes3788 Před 4 měsíci +2

    So clear explanations! Thanks so much Yannic.

  • @y29k15
    @y29k15 Před 4 měsíci +3

    I love videos on unsupervised learning methods, especially those unlike most large language models that try to compute encodings/latents.

  • @FredPauling
    @FredPauling Před 4 měsíci +2

    Thanks for the breakdown of this paper. It's easier to digest with a bit of dry humour!

  • @LukasSmith827
    @LukasSmith827 Před 4 měsíci +2

    very nice, thank you for the clarifications bc this paper was kinda hard to read before

  • @halocemagnum8351
    @halocemagnum8351 Před 4 měsíci

    I always appreciate your awesome videos! Great content as always. Frankly I’m surprised there hasn’t been more effort toward applying JEPA to RL, given that model based extrapolation for RL was the entire point of Yang Lecuun’s original paper! Now that they’ve got a video based model, seems like there would be nothing holding them back from actually trying it.
    Can’t wait for JEPA-M , where the M stands for Minecraft.

    • @EdFormer
      @EdFormer Před měsícem

      The paper is called "a path towards autonomous machine intelligence" - where did you get that the point was about model based extrapolation for RL? After all, LeCun has said that RL is just the cherry on the top of the cake, while supervised learning is the icing, and self supervised learning is the actual cake, so he hardly sees RL as the priority. That aside, what we see here is a world model predicting some states of the world from others, while LeCun's model would require also considering potential actions of the agent in this prediction, which would be much harder to gather training data for.

  • @vimukthirandika872
    @vimukthirandika872 Před 3 měsíci +1

    Excellent explanation❤

  • @mshonle
    @mshonle Před 4 měsíci +1

    Yay! *clap* good job!

  • @CristianGarcia
    @CristianGarcia Před 4 měsíci +2

    thanks!

  • @JammyMiddleofN
    @JammyMiddleofN Před 4 měsíci +2

    40:40
    My latent Z was not expecting that video continuation...

  • @user_gmg8607
    @user_gmg8607 Před 4 měsíci +10

    Название - огонь. Русские поймут)

  • @janrocketman9542
    @janrocketman9542 Před 4 měsíci +2

    I believe it's wrong reasoning around 26:15 when you discuss the JEPA scheme. It's not so important to use the EMA version for Enc(y) and you can actually replace it with the same parameters (e.g. SimSiam does that). It's just a trick to boost quality a bit

  • @gurkirtsingh
    @gurkirtsingh Před 4 měsíci

    Do you think, it can replace triplet loss in tracking where you don't have label available to train triplet loss,

  • @hasantekin7823
    @hasantekin7823 Před 3 měsíci

    It is similar to how quantum mechanics work (in my head). JEPA models don't turn data into pixels unless necessary. Like quantum objects having wave function which collapses to a point when observed.

  • @tpty_pbhs
    @tpty_pbhs Před 3 měsíci

    subscribed

  • @abunapha
    @abunapha Před 3 měsíci

    can you do a video on the Microsoft 1.5 bit LLM paper?

  • @jawadmansoor6064
    @jawadmansoor6064 Před 4 měsíci

    latent variable energy based models can be used for text generation as well, right? how will they fair against current statistical models? i suppose this will be much more energy efficient and can have infinite (or very long like human brain) capacity to understand and generate text. are there researches around it ?

    • @jawadmansoor6064
      @jawadmansoor6064 Před 4 měsíci +1

      I learned a lot, thank you gemini and bing and meta and yannic.

  • @kimchi_taco
    @kimchi_taco Před 4 měsíci +1

    Few complaints
    * What is difference from MAE? MAE has version to predict EMA output...
    * Pixel vs latent seems not fair. Top few layers of Pixel encoder must be retrained as they focus on pixel reconstruction.
    * z = mask info is pity... z was more important than it in original JEPA design.

  • @HUEHUEUHEPony
    @HUEHUEUHEPony Před 4 měsíci

    Is this like inpainting but for videos?

  • @IronMechanic7110
    @IronMechanic7110 Před 4 měsíci

    Jepa is the futur of AI.

  • @propeacemindfortress
    @propeacemindfortress Před 4 měsíci

    more fish for Yann LeCat!

  • @fintech1378
    @fintech1378 Před 4 měsíci +1

    Can you do a tutorial for the github implementation

  • @blackswaneleven
    @blackswaneleven Před 3 měsíci +2

    Ну и название).

  • @14types
    @14types Před 4 měsíci +3

    Almost JOPA

    • @acatormt7096
      @acatormt7096 Před 4 měsíci

      Jepa is even funnier

    • @14types
      @14types Před 4 měsíci

      jopa is russian vulgar word of some part of body@@acatormt7096

  • @lukebyrne6113
    @lukebyrne6113 Před 4 měsíci +1

    Shame the V-JEPA code licence forbids commercial use.

  • @MrMIB983
    @MrMIB983 Před 4 měsíci

    Use dark mode bro

  • @teckyify
    @teckyify Před 4 měsíci +3

    Really? How humans do it? As if they have undertaken any serious work to find that out.

  • @Jay-kb7if
    @Jay-kb7if Před 4 měsíci +2

    am I losing my mind or is this just trying to dress up videoMAE/vit? wasn't that what the original ViT was about? This just seemslike they chucked something out prematurely as the github repo stinks. Sora is very similar to V-JEPA so it makes sense as to why it was released now.

  • @barrettkepler7618
    @barrettkepler7618 Před 3 měsíci

    Sorry, I just can't listen to the word "jepa" repeated so many times😂

  • @ellenluminescense
    @ellenluminescense Před 4 měsíci +1

    Every 5 minutes there is an advertisement for a minute. Can you please stop CZcams from doing this?

    • @immortalsofar7977
      @immortalsofar7977 Před 4 měsíci +4

      His channel is monetized. Let the guy supplement his income from his videos. His hard work is appreciated and you can show it by simply watching a few ads.

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh Před 4 měsíci

      What kind of fool browses the web without an add blocker? Do you hate your eyeballs? Do you enjoy dodging on page popups to read a block of text? Are you some sort of sadist? The web is simply not usable without a good adblocker. What is wrong with you?

    • @YannicKilcher
      @YannicKilcher  Před 4 měsíci +6

      it was too much indeed. YT places these automatically, I've reduced them to 1/3rd manually. Thanks for letting me know

    • @Zantorc
      @Zantorc Před 4 měsíci +1

      CZcams adverts are optional, any decent free ad blocker will skip them.

    • @DelandaBaudLacanian
      @DelandaBaudLacanian Před 4 měsíci

      ​@@YannicKilcherthanks Yannic you rock