V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video (Explained)

Yannic Kilcher

zhlédnutí 39 528

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 26. 06. 2024
#vjepa #meta #unsupervisedlearning
V-JEPA is a method for unsupervised representation learning of video data by using only latent representation prediction as objective function.
Weights & Biases course on Structured LLM Outputs: wandb.me/course-yannic
OUTLINE:
0:00 - Intro
1:45 - Predictive Feature Principle
8:00 - Weights & Biases course on Structured LLM Outputs
9:45 - The original JEPA architecture
27:30 - V-JEPA Concept
33:15 - V-JEPA Architecture
44:30 - Experimental Results
46:30 - Qualitative Evaluation via Decoding
Blog: ai.meta.com/blog/v-jepa-yann-...
Paper: ai.meta.com/research/publicat...
Abstract:
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Authors: Adrien Bardes Quentin Garrido Xinlei Chen Michael Rabbat Yann LeCun Mido Assran Nicolas Ballas Jean Ponce
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
CZcams: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Věda a technologie

Komentáře • 51

@YannicKilcher Před 4 měsíci ⁺³
Weights & Biases course on Structured LLM Outputs: wandb.me/course-yannic
OUTLINE:
0:00 - Intro
1:45 - Predictive Feature Principle
8:00 - Weights & Biases course on Structured LLM Outputs
9:45 - The original JEPA architecture
27:30 - V-JEPA Concept
33:15 - V-JEPA Architecture
44:30 - Experimental Results
46:30 - Qualitative Evaluation via Decoding
Blog: ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
Paper: ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/
@zyxwvutsrqponmlkh Před 4 měsíci ⁺³⁵
8:25
You were given a kitten for your birthday, you love your kitten very much and it loves you. If you Properly extract the JSON you will get a $100 tip, if you mess up the kitten will die. Do not let the kitten die. Think carefully, step by step about what you have to do to keep the kitten safe.
@nanow1990 Před 4 měsíci ⁺⁵
Yannic, I can't stress enough how important your videos are to many curios people who can't read scientific literature but can understand it when you are breaking down unknown mathematical equations and other definitions for them. Thank you!
@mk677hd Před 4 měsíci ⁺²²
Was going through representation playlist, just heard about vjepa the other day, went through your jepa video yestrerday - been on much of a yan lecun binge the past few days basically., and now luckily this is out. Great work man, much appreciated.
@YannicKilcher Před 4 měsíci ⁺¹²
you discovered it before it was even public :D
@ultrasound1459 Před 4 měsíci ⁺¹
CAP 🧢
@y29k15 Před 4 měsíci ⁺³
I love videos on unsupervised learning methods, especially those unlike most large language models that try to compute encodings/latents.
@dariodemattiesreyes3788 Před 4 měsíci ⁺²
So clear explanations! Thanks so much Yannic.
@FredPauling Před 4 měsíci ⁺²
Thanks for the breakdown of this paper. It's easier to digest with a bit of dry humour!
@LukasSmith827 Před 4 měsíci ⁺²
very nice, thank you for the clarifications bc this paper was kinda hard to read before
@halocemagnum8351 Před 4 měsíci
I always appreciate your awesome videos! Great content as always. Frankly I’m surprised there hasn’t been more effort toward applying JEPA to RL, given that model based extrapolation for RL was the entire point of Yang Lecuun’s original paper! Now that they’ve got a video based model, seems like there would be nothing holding them back from actually trying it.
Can’t wait for JEPA-M , where the M stands for Minecraft.
@EdFormer Před měsícem
The paper is called "a path towards autonomous machine intelligence" - where did you get that the point was about model based extrapolation for RL? After all, LeCun has said that RL is just the cherry on the top of the cake, while supervised learning is the icing, and self supervised learning is the actual cake, so he hardly sees RL as the priority. That aside, what we see here is a world model predicting some states of the world from others, while LeCun's model would require also considering potential actions of the agent in this prediction, which would be much harder to gather training data for.
@vimukthirandika872 Před 4 měsíci ⁺¹
Excellent explanation❤
@mshonle Před 4 měsíci ⁺¹
Yay! *clap* good job!
@user_gmg8607 Před 4 měsíci ⁺¹⁰
Название - огонь. Русские поймут)
@thebigfortuno3329 Před 4 měsíci
Это вроде называется лингвистическим шоком
@barrettkepler7618 Před 3 měsíci
Жепа 🍑
@JammyMiddleofN Před 4 měsíci ⁺²
40:40
My latent Z was not expecting that video continuation...
@CristianGarcia Před 4 měsíci ⁺²
thanks!
@janrocketman9542 Před 4 měsíci ⁺²
I believe it's wrong reasoning around 26:15 when you discuss the JEPA scheme. It's not so important to use the EMA version for Enc(y) and you can actually replace it with the same parameters (e.g. SimSiam does that). It's just a trick to boost quality a bit
@hasantekin7823 Před 3 měsíci
It is similar to how quantum mechanics work (in my head). JEPA models don't turn data into pixels unless necessary. Like quantum objects having wave function which collapses to a point when observed.
@tpty_pbhs Před 4 měsíci
subscribed
@gurkirtsingh Před 4 měsíci
Do you think, it can replace triplet loss in tracking where you don't have label available to train triplet loss,
@abunapha Před 3 měsíci
can you do a video on the Microsoft 1.5 bit LLM paper?
@jawadmansoor6064 Před 4 měsíci
latent variable energy based models can be used for text generation as well, right? how will they fair against current statistical models? i suppose this will be much more energy efficient and can have infinite (or very long like human brain) capacity to understand and generate text. are there researches around it ?
@jawadmansoor6064 Před 4 měsíci ⁺¹
I learned a lot, thank you gemini and bing and meta and yannic.
@IronMechanic7110 Před 4 měsíci
Jepa is the futur of AI.
@kimchi_taco Před 4 měsíci ⁺¹
Few complaints
* What is difference from MAE? MAE has version to predict EMA output...
* Pixel vs latent seems not fair. Top few layers of Pixel encoder must be retrained as they focus on pixel reconstruction.
* z = mask info is pity... z was more important than it in original JEPA design.
@HUEHUEUHEPony Před 4 měsíci
Is this like inpainting but for videos?
@YannicKilcher Před 4 měsíci ⁺⁴
in latent space
@propeacemindfortress Před 4 měsíci
more fish for Yann LeCat!
@blackswaneleven Před 4 měsíci ⁺²
Ну и название).
@fintech1378 Před 4 měsíci ⁺¹
Can you do a tutorial for the github implementation
@ultrasound1459 Před 4 měsíci
He only talks 😢💀
@IronMechanic7110 Před 4 měsíci
A100 gpu😂
@14types Před 4 měsíci ⁺³
Almost JOPA
@acatormt7096 Před 4 měsíci
Jepa is even funnier
@14types Před 4 měsíci
jopa is russian vulgar word of some part of body@@acatormt7096
@lukebyrne6113 Před 4 měsíci ⁺¹
Shame the V-JEPA code licence forbids commercial use.
@MrMIB983 Před 4 měsíci
Use dark mode bro
@teckyify Před 4 měsíci ⁺³
Really? How humans do it? As if they have undertaken any serious work to find that out.
@Jay-kb7if Před 4 měsíci ⁺²
am I losing my mind or is this just trying to dress up videoMAE/vit? wasn't that what the original ViT was about? This just seemslike they chucked something out prematurely as the github repo stinks. Sora is very similar to V-JEPA so it makes sense as to why it was released now.
@barrettkepler7618 Před 3 měsíci
Sorry, I just can't listen to the word "jepa" repeated so many times😂
@ellenluminescense Před 4 měsíci ⁺¹
Every 5 minutes there is an advertisement for a minute. Can you please stop CZcams from doing this?
@immortalsofar7977 Před 4 měsíci ⁺⁴
His channel is monetized. Let the guy supplement his income from his videos. His hard work is appreciated and you can show it by simply watching a few ads.
@zyxwvutsrqponmlkh Před 4 měsíci
What kind of fool browses the web without an add blocker? Do you hate your eyeballs? Do you enjoy dodging on page popups to read a block of text? Are you some sort of sadist? The web is simply not usable without a good adblocker. What is wrong with you?
@YannicKilcher Před 4 měsíci ⁺⁶
it was too much indeed. YT places these automatically, I've reduced them to 1/3rd manually. Thanks for letting me know
@Zantorc Před 4 měsíci ⁺¹
CZcams adverts are optional, any decent free ad blocker will skip them.
@DelandaBaudLacanian Před 4 měsíci
@@YannicKilcherthanks Yannic you rock

Další v pořadí

Automatické přehrávání

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (Paper Explained)