ICLR 2020: Yann LeCun and Energy-Based Models
Vložit
- čas přidán 10. 05. 2020
- This week Connor Shorten, Yannic Kilcher and Tim Scarfe reacted to Yann LeCun's keynote speech at this year's ICLR conference which just passed. ICLR is the number two ML conference and was completely open this year, with all the sessions publicly accessible via the internet. Yann spent most of his talk speaking about self-supervised learning, Energy-based models (EBMs) and manifold learning. Don't worry if you hadn't heard of EBMs before, neither had we!
Thanks for watching! Please Subscribe!
Paper Links:
ICLR 2020 Keynote Talk: iclr.cc/virtual_2020/speaker_...
A Tutorial on Energy-Based Learning: yann.lecun.com/exdb/publis/pdf...
Concept Learning with Energy-Based Models (Yannic's Explanation): • Concept Learning with ...
Concept Learning with Energy-Based Models (Paper): arxiv.org/pdf/1811.02486.pdf
Concept Learning with Energy-Based Models (OpenAI Blog Post): openai.com/blog/learning-conc...
#deeplearning #machinelearning #iclr #iclr2020 #yannlecun
5:35 Energy Functions, a Hitchhiker’s Guide to the Machine Learning Galaxy
11:07 Initial Reactions to the LeCun’s Talk
19:50 The Future is Self-Supervised, Early Concept Acquisition in Infants
24:35 The REVOLUTION WILL NOT BE SUPERVISED!
25:44 Three Challenges for Deep Learning
30:18 Self-Supervised Learning is Fill-in-the-Blanks
31:30 Inference and Multimodal Predictions
33:25 Energy-Based Models “Without Resorting to Probabilities”
37:33 Gradient-Based Infernce
39:35 Unconditional version of Energy-based Models, how K-Means is an Energy-based Model
41:26 Latent Variable EBM
(To Be Continued)
I like the music as an intro, but I think it's okay to fade it out earlier as it distracts from the explanation
I'm really sorry about that, on CZcams and on my phone now it sounds way louder than it did on my main machine. I'll be more careful next time and check the levels on another machine 😁😊👍
@@machinelearningdojowithtim2898 Generally toning down the intro in terms of fast cuts, cut effects and music and making it a bit more 'relaxed' might help. I don't know, I prefer a slightly quieter approach.
@@csr7080 fair enough, let's compromise on the next one ;)
Music is too loud and I don't think it's suitable at all.
Loved this discussion! Another way to think of "the manifold" is through D-dimensional heat-maps (or scalar fields). A continuous energy-based model defines one D-dimensional heat-map, and the true data distribution defines another. Energy-based methods hope to make the "cold" (i.e. low-valued) regions in the true map "cold" in the model map. Contrastive methods transfer "heat/coldness" patterns from the true map to the model map. Regularized methods make sure that the model doesn't just make the whole map cold! :)
This is a great description, thanks
This was a wonderful discourse on EBMs. I'm glad I spent enough time to understand the concept of learning manifolds from these guys. Worth it.
The music I think is good but please lower the volume when someone is speaking
I loved the music, appropriately epic IMO..... Your content is outstanding, sincere thanks to you all.
I think a key point for casting existing methods into the energy framework is that it allows you to understand that existing methods are particular points on a broader spectrum and therefore there are gaps that exists between existing methods that could be equally valid and more effective. It wasn't covered here, but I'd like to hear more about how the probability framework results for less smooth surfaces, which might inhibit learning compared to energy-based methods.
But I think the idea of doing gradient descent as part of inference is a fantastically interesting idea. Combine that with the concept of LISTA, which uses a NN to predict the outcome of this "SGD on Z" process, and this becomes a bit like transitioning from System 2 to System 1. In other words, if you do this "SGD at inference" enough times, you can fit a predictive model to that process. Then, recovering the optimal Z is just another feed-forward inference exercise, which is more like System 1 intuition (along with all the opportunities for mistakes and biases).
Yann Lecunn, has shared this video on his Facebook wall. Will spend today and tomorrow slowing going through the video in phases.
Amazing that Yann shared on Facebook too! We are super excited about where we can take this channel
Super cool talk, thanks!
Music ends around 7:54
1:30:40 I think language is continuous in temporal domain whereas Image in spatial domain may not be continuous which leads to denoising an image with masked pixel to be less effective in contrast to a masked language model.
Great talk thank you for your effort and patience!
I'm curious about the point on using energy during inference, not learning. Could this be related to a sort 'esoteric inference' the models might do, something akin to the Concept Learning model where the energy function is used during what they call 'execution time' to infer, internally, concepts out of pairs of input data? Wonder if that makes sense, if LeCun had an idea of internalizing the process of inference in the model as a way to learn more abstract and difficult tasks / concepts, like in the same paper SGD is used to create an output?
Yes, that makes sense and you're absolutely on the right track. Of course, we can't speak for LeCun here, but as I imagine it this - what you're saying - is one of the advantages that these EBMs have. Of course the power of using something like SGD during inference comes with the cost that you have to train this somehow.
I get the feeling self supervised learning for images and video is going to take many decades to figure out.
What's the difference between an energy function and a general cost function?
Very informative! I still don't quite get what a manifold is though, can you suggest me some great sources? Also, is manifold somehow related to loss landscape? Thanks!
Hey Arka! A manifold is just all the places where certain types of data can exist. For example a 2d plane is a manifold, you can only place points on that plane. Imagine the 3d locations of all the cities in the world, they exist on the surface of a spherical manifold. And real world data also sits on a manifold, albeit, a very complicated one which you couldn't visualise! It's a great way to reason about the inner workings of deep learning models.
Do you mind reuploading the video without the music? I was eager to watch this, but music made me quit after first 1min.
It's a 2 hour show, just skip forward a few minutes
36:23 Why energy is used for inference, but not for learning? Also, Yannic is amazing at reducing complicated topics into understandable terms!
The introduction answers the long-standing question: what would happen if a bunch of computer scientists DJ'd a rave?
I'm definitely on board the machine learning rave train.
Bro you are going to make me write an AI to remove background music
Why is there background music? :(((((((
Old video, we are skilling up on video editing all the time! Sorry!
I disagree with Yannic's assertion that System 1 and System 2 are arbitrary distinctions. We are talking about computing systems here, and there are different kinds of computer / data pairs. System 1 corresponds to DFAs / regular languages, while System 2 is context-free or higher. Hierarchical decomposition, as in planning, amounts to grammatical parsing, which is fundamentally distinct from regex- and CNN-style pattern matching.
It is true that some tasks originally learned in System 2 can eventually be distilled and passed down to System 1, since every regular language is a subset of some context-free language. But it is also true that there are some System 2 tasks that can never be passed down to System 1. For example, multiplying 8 digit numbers, composing Petrarchan sonnets, proving theorems, remembering what your wife just said, or writing in Assembly. There is no "muscle memory" for higher level tasks like these -- they require sustained conscious attention, just as every computer higher than a DFA requires memory.
Agree with Yannic's point about babies. It's like asking, "How do spiders learn how to spin webs so quickly?"
multi-task learning is way more straightforward and the tech is already working well. Unsupervised learning will never be optimal because it has no ability to discard information that is not relevant to tasks we care about
There is a story about John von Neumann and a-bit-tricky-yet-very-simple problem: train stations A and B are 150 miles apart. Train T1 goes from station A to station B, and train T2 goes from station B to station A. Both trains travel at 75 miles/hour and they leave at the same time. A fly, initially on train T1, flies towards T2 at 50 miles / hour. When it reaches T2 it instantly turns back and flies towards T1 (and the pattern repeats). How many miles does the fly travel before it meets its inevitable fate? 😊
The story is that von Neumann instantly answered 50 miles. When asked how he did it, he said: "I summed the infinite series, of course." 😁This story is often told to talk about von Neumann's genius, but I think it should also be seen as a reminder that genius can sometimes overlook the simplest and most elegant solutions.
I don't understand most of the things discussed in this video (I'm not at that level yet ) but, from having watched Yann LeCun's past talks, I wonder if sometimes the experts are too smart to see the "dumb" solutions 😁Of course, I have no clue. I'll keep learning though. I really enjoyed watching your conversation and seeing you try figuring out the workings of a brilliant mind. It's very helpful.
That's a wonderful anecdote Bianca! Thanks for sharing!
50 miles makes no sense. You must have memorized the wrong numbers.
The answer is 50 miles, but the point of the story is that von Neumann, who solved it immediately, didn't use the "aha" method (trains meet in 1hr, ergo fly travels 50 miles since he's flying at 50mph). von Neumann instead instantantly formulated and summed the infinite series in his head.
@@benbridgwater6479 Suddenly it stops when trains meet? What infinite series?
Edit: So I looked it up. The fly is supposed to be faster than the train(!) and pinball between the trains until it gets squashed(!). With that info it makes sense. Here: www.infiltec.com/j-logic.htm
@@aBigBadWolf 😁Yes and no. I got the wrong numbers but not because I memorized them wrongly. I didn't memorize them at all. It was the logic of the story that I remembered. Unfortunately I forgot the fly had to be faster than the trains for this to make sense at all. 😁
The music makes this unwatchable
Yeah, like others have said, this sucks