V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video (Explained)
Vložit
- čas přidán 19. 06. 2024
- #vjepa #meta #unsupervisedlearning
V-JEPA is a method for unsupervised representation learning of video data by using only latent representation prediction as objective function.
Weights & Biases course on Structured LLM Outputs: wandb.me/course-yannic
OUTLINE:
0:00 - Intro
1:45 - Predictive Feature Principle
8:00 - Weights & Biases course on Structured LLM Outputs
9:45 - The original JEPA architecture
27:30 - V-JEPA Concept
33:15 - V-JEPA Architecture
44:30 - Experimental Results
46:30 - Qualitative Evaluation via Decoding
Blog: ai.meta.com/blog/v-jepa-yann-...
Paper: ai.meta.com/research/publicat...
Abstract:
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Authors: Adrien Bardes Quentin Garrido Xinlei Chen Michael Rabbat Yann LeCun Mido Assran Nicolas Ballas Jean Ponce
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
CZcams: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n - Věda a technologie
Weights & Biases course on Structured LLM Outputs: wandb.me/course-yannic
OUTLINE:
0:00 - Intro
1:45 - Predictive Feature Principle
8:00 - Weights & Biases course on Structured LLM Outputs
9:45 - The original JEPA architecture
27:30 - V-JEPA Concept
33:15 - V-JEPA Architecture
44:30 - Experimental Results
46:30 - Qualitative Evaluation via Decoding
Blog: ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
Paper: ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/
8:25
You were given a kitten for your birthday, you love your kitten very much and it loves you. If you Properly extract the JSON you will get a $100 tip, if you mess up the kitten will die. Do not let the kitten die. Think carefully, step by step about what you have to do to keep the kitten safe.
Was going through representation playlist, just heard about vjepa the other day, went through your jepa video yestrerday - been on much of a yan lecun binge the past few days basically., and now luckily this is out. Great work man, much appreciated.
you discovered it before it was even public :D
CAP 🧢
Yannic, I can't stress enough how important your videos are to many curios people who can't read scientific literature but can understand it when you are breaking down unknown mathematical equations and other definitions for them. Thank you!
So clear explanations! Thanks so much Yannic.
I love videos on unsupervised learning methods, especially those unlike most large language models that try to compute encodings/latents.
Thanks for the breakdown of this paper. It's easier to digest with a bit of dry humour!
very nice, thank you for the clarifications bc this paper was kinda hard to read before
I always appreciate your awesome videos! Great content as always. Frankly I’m surprised there hasn’t been more effort toward applying JEPA to RL, given that model based extrapolation for RL was the entire point of Yang Lecuun’s original paper! Now that they’ve got a video based model, seems like there would be nothing holding them back from actually trying it.
Can’t wait for JEPA-M , where the M stands for Minecraft.
The paper is called "a path towards autonomous machine intelligence" - where did you get that the point was about model based extrapolation for RL? After all, LeCun has said that RL is just the cherry on the top of the cake, while supervised learning is the icing, and self supervised learning is the actual cake, so he hardly sees RL as the priority. That aside, what we see here is a world model predicting some states of the world from others, while LeCun's model would require also considering potential actions of the agent in this prediction, which would be much harder to gather training data for.
Excellent explanation❤
Yay! *clap* good job!
thanks!
40:40
My latent Z was not expecting that video continuation...
Название - огонь. Русские поймут)
Это вроде называется лингвистическим шоком
Жепа 🍑
I believe it's wrong reasoning around 26:15 when you discuss the JEPA scheme. It's not so important to use the EMA version for Enc(y) and you can actually replace it with the same parameters (e.g. SimSiam does that). It's just a trick to boost quality a bit
Do you think, it can replace triplet loss in tracking where you don't have label available to train triplet loss,
It is similar to how quantum mechanics work (in my head). JEPA models don't turn data into pixels unless necessary. Like quantum objects having wave function which collapses to a point when observed.
subscribed
can you do a video on the Microsoft 1.5 bit LLM paper?
latent variable energy based models can be used for text generation as well, right? how will they fair against current statistical models? i suppose this will be much more energy efficient and can have infinite (or very long like human brain) capacity to understand and generate text. are there researches around it ?
I learned a lot, thank you gemini and bing and meta and yannic.
Few complaints
* What is difference from MAE? MAE has version to predict EMA output...
* Pixel vs latent seems not fair. Top few layers of Pixel encoder must be retrained as they focus on pixel reconstruction.
* z = mask info is pity... z was more important than it in original JEPA design.
Is this like inpainting but for videos?
in latent space
Jepa is the futur of AI.
more fish for Yann LeCat!
Can you do a tutorial for the github implementation
He only talks 😢💀
A100 gpu😂
Ну и название).
Almost JOPA
Jepa is even funnier
jopa is russian vulgar word of some part of body@@acatormt7096
Shame the V-JEPA code licence forbids commercial use.
Use dark mode bro
Really? How humans do it? As if they have undertaken any serious work to find that out.
am I losing my mind or is this just trying to dress up videoMAE/vit? wasn't that what the original ViT was about? This just seemslike they chucked something out prematurely as the github repo stinks. Sora is very similar to V-JEPA so it makes sense as to why it was released now.
Sorry, I just can't listen to the word "jepa" repeated so many times😂
Every 5 minutes there is an advertisement for a minute. Can you please stop CZcams from doing this?
His channel is monetized. Let the guy supplement his income from his videos. His hard work is appreciated and you can show it by simply watching a few ads.
What kind of fool browses the web without an add blocker? Do you hate your eyeballs? Do you enjoy dodging on page popups to read a block of text? Are you some sort of sadist? The web is simply not usable without a good adblocker. What is wrong with you?
it was too much indeed. YT places these automatically, I've reduced them to 1/3rd manually. Thanks for letting me know
CZcams adverts are optional, any decent free ad blocker will skip them.
@@YannicKilcherthanks Yannic you rock