ICLR 2020: Yann LeCun and Energy-Based Models

Sdílet
Vložit
  • čas přidán 10. 05. 2020
  • This week Connor Shorten, Yannic Kilcher and Tim Scarfe reacted to Yann LeCun's keynote speech at this year's ICLR conference which just passed. ICLR is the number two ML conference and was completely open this year, with all the sessions publicly accessible via the internet. Yann spent most of his talk speaking about self-supervised learning, Energy-based models (EBMs) and manifold learning. Don't worry if you hadn't heard of EBMs before, neither had we!
    Thanks for watching! Please Subscribe!
    Paper Links:
    ICLR 2020 Keynote Talk: iclr.cc/virtual_2020/speaker_...
    A Tutorial on Energy-Based Learning: yann.lecun.com/exdb/publis/pdf...
    Concept Learning with Energy-Based Models (Yannic's Explanation): • Concept Learning with ...
    Concept Learning with Energy-Based Models (Paper): arxiv.org/pdf/1811.02486.pdf
    Concept Learning with Energy-Based Models (OpenAI Blog Post): openai.com/blog/learning-conc...
    #deeplearning #machinelearning #iclr #iclr2020 #yannlecun

Komentáře • 44

  • @MachineLearningStreetTalk

    5:35 Energy Functions, a Hitchhiker’s Guide to the Machine Learning Galaxy
    11:07 Initial Reactions to the LeCun’s Talk
    19:50 The Future is Self-Supervised, Early Concept Acquisition in Infants
    24:35 The REVOLUTION WILL NOT BE SUPERVISED!
    25:44 Three Challenges for Deep Learning
    30:18 Self-Supervised Learning is Fill-in-the-Blanks
    31:30 Inference and Multimodal Predictions
    33:25 Energy-Based Models “Without Resorting to Probabilities”
    37:33 Gradient-Based Infernce
    39:35 Unconditional version of Energy-based Models, how K-Means is an Energy-based Model
    41:26 Latent Variable EBM
    (To Be Continued)

  • @sphereron
    @sphereron Před 4 lety +54

    I like the music as an intro, but I think it's okay to fade it out earlier as it distracts from the explanation

    • @machinelearningdojowithtim2898
      @machinelearningdojowithtim2898 Před 4 lety +7

      I'm really sorry about that, on CZcams and on my phone now it sounds way louder than it did on my main machine. I'll be more careful next time and check the levels on another machine 😁😊👍

    • @csr7080
      @csr7080 Před 4 lety +2

      @@machinelearningdojowithtim2898 Generally toning down the intro in terms of fast cuts, cut effects and music and making it a bit more 'relaxed' might help. I don't know, I prefer a slightly quieter approach.

    • @machinelearningdojowithtim2898
      @machinelearningdojowithtim2898 Před 4 lety +2

      @@csr7080 fair enough, let's compromise on the next one ;)

  • @XOPOIIIO
    @XOPOIIIO Před 4 lety +40

    Music is too loud and I don't think it's suitable at all.

  • @charlesfoster6326
    @charlesfoster6326 Před 4 lety +11

    Loved this discussion! Another way to think of "the manifold" is through D-dimensional heat-maps (or scalar fields). A continuous energy-based model defines one D-dimensional heat-map, and the true data distribution defines another. Energy-based methods hope to make the "cold" (i.e. low-valued) regions in the true map "cold" in the model map. Contrastive methods transfer "heat/coldness" patterns from the true map to the model map. Regularized methods make sure that the model doesn't just make the whole map cold! :)

  • @nprithvi24
    @nprithvi24 Před 4 lety

    This was a wonderful discourse on EBMs. I'm glad I spent enough time to understand the concept of learning manifolds from these guys. Worth it.

  • @IRWBRW964
    @IRWBRW964 Před 4 lety +5

    The music I think is good but please lower the volume when someone is speaking

  • @antonschwarz6685
    @antonschwarz6685 Před 3 lety +1

    I loved the music, appropriately epic IMO..... Your content is outstanding, sincere thanks to you all.

  • @welcomeaioverlords
    @welcomeaioverlords Před 4 lety +2

    I think a key point for casting existing methods into the energy framework is that it allows you to understand that existing methods are particular points on a broader spectrum and therefore there are gaps that exists between existing methods that could be equally valid and more effective. It wasn't covered here, but I'd like to hear more about how the probability framework results for less smooth surfaces, which might inhibit learning compared to energy-based methods.
    But I think the idea of doing gradient descent as part of inference is a fantastically interesting idea. Combine that with the concept of LISTA, which uses a NN to predict the outcome of this "SGD on Z" process, and this becomes a bit like transitioning from System 2 to System 1. In other words, if you do this "SGD at inference" enough times, you can fit a predictive model to that process. Then, recovering the optimal Z is just another feed-forward inference exercise, which is more like System 1 intuition (along with all the opportunities for mistakes and biases).

  • @rakeshmallick9161
    @rakeshmallick9161 Před 4 lety +1

    Yann Lecunn, has shared this video on his Facebook wall. Will spend today and tomorrow slowing going through the video in phases.

  • @Dougystyle11
    @Dougystyle11 Před 3 lety

    Super cool talk, thanks!

  • @robbiero368
    @robbiero368 Před 4 lety +3

    Music ends around 7:54

  • @mohammadxahid5984
    @mohammadxahid5984 Před 4 lety

    1:30:40 I think language is continuous in temporal domain whereas Image in spatial domain may not be continuous which leads to denoising an image with masked pixel to be less effective in contrast to a masked language model.

  • @theodorosgalanos9663
    @theodorosgalanos9663 Před 4 lety +2

    Great talk thank you for your effort and patience!
    I'm curious about the point on using energy during inference, not learning. Could this be related to a sort 'esoteric inference' the models might do, something akin to the Concept Learning model where the energy function is used during what they call 'execution time' to infer, internally, concepts out of pairs of input data? Wonder if that makes sense, if LeCun had an idea of internalizing the process of inference in the model as a way to learn more abstract and difficult tasks / concepts, like in the same paper SGD is used to create an output?

    • @YannicKilcher
      @YannicKilcher Před 4 lety

      Yes, that makes sense and you're absolutely on the right track. Of course, we can't speak for LeCun here, but as I imagine it this - what you're saying - is one of the advantages that these EBMs have. Of course the power of using something like SGD during inference comes with the cost that you have to train this somehow.

  • @YouLoveMrFriendly
    @YouLoveMrFriendly Před 4 lety +1

    I get the feeling self supervised learning for images and video is going to take many decades to figure out.

  • @zxl2537
    @zxl2537 Před 2 lety

    What's the difference between an energy function and a general cost function?

  • @arkasaha4412
    @arkasaha4412 Před 4 lety +1

    Very informative! I still don't quite get what a manifold is though, can you suggest me some great sources? Also, is manifold somehow related to loss landscape? Thanks!

    • @machinelearningdojowithtim2898
      @machinelearningdojowithtim2898 Před 4 lety

      Hey Arka! A manifold is just all the places where certain types of data can exist. For example a 2d plane is a manifold, you can only place points on that plane. Imagine the 3d locations of all the cities in the world, they exist on the surface of a spherical manifold. And real world data also sits on a manifold, albeit, a very complicated one which you couldn't visualise! It's a great way to reason about the inner workings of deep learning models.

  • @rytiskazimierasjonynas4561

    Do you mind reuploading the video without the music? I was eager to watch this, but music made me quit after first 1min.

  • @andres_pq
    @andres_pq Před 3 lety

    36:23 Why energy is used for inference, but not for learning? Also, Yannic is amazing at reducing complicated topics into understandable terms!

  • @vinca43
    @vinca43 Před 3 lety

    The introduction answers the long-standing question: what would happen if a bunch of computer scientists DJ'd a rave?
    I'm definitely on board the machine learning rave train.

  • @Lupobass1
    @Lupobass1 Před 4 lety +6

    Bro you are going to make me write an AI to remove background music

  • @Georgesbarsukov
    @Georgesbarsukov Před 2 lety

    Why is there background music? :(((((((

  • @snippletrap
    @snippletrap Před 4 lety

    I disagree with Yannic's assertion that System 1 and System 2 are arbitrary distinctions. We are talking about computing systems here, and there are different kinds of computer / data pairs. System 1 corresponds to DFAs / regular languages, while System 2 is context-free or higher. Hierarchical decomposition, as in planning, amounts to grammatical parsing, which is fundamentally distinct from regex- and CNN-style pattern matching.
    It is true that some tasks originally learned in System 2 can eventually be distilled and passed down to System 1, since every regular language is a subset of some context-free language. But it is also true that there are some System 2 tasks that can never be passed down to System 1. For example, multiplying 8 digit numbers, composing Petrarchan sonnets, proving theorems, remembering what your wife just said, or writing in Assembly. There is no "muscle memory" for higher level tasks like these -- they require sustained conscious attention, just as every computer higher than a DFA requires memory.

  • @snippletrap
    @snippletrap Před 4 lety

    Agree with Yannic's point about babies. It's like asking, "How do spiders learn how to spin webs so quickly?"

  • @josephgardi7522
    @josephgardi7522 Před 2 lety

    multi-task learning is way more straightforward and the tech is already working well. Unsupervised learning will never be optimal because it has no ability to discard information that is not relevant to tasks we care about

  • @BiancaAguglia
    @BiancaAguglia Před 4 lety +3

    There is a story about John von Neumann and a-bit-tricky-yet-very-simple problem: train stations A and B are 150 miles apart. Train T1 goes from station A to station B, and train T2 goes from station B to station A. Both trains travel at 75 miles/hour and they leave at the same time. A fly, initially on train T1, flies towards T2 at 50 miles / hour. When it reaches T2 it instantly turns back and flies towards T1 (and the pattern repeats). How many miles does the fly travel before it meets its inevitable fate? 😊
    The story is that von Neumann instantly answered 50 miles. When asked how he did it, he said: "I summed the infinite series, of course." 😁This story is often told to talk about von Neumann's genius, but I think it should also be seen as a reminder that genius can sometimes overlook the simplest and most elegant solutions.
    I don't understand most of the things discussed in this video (I'm not at that level yet ) but, from having watched Yann LeCun's past talks, I wonder if sometimes the experts are too smart to see the "dumb" solutions 😁Of course, I have no clue. I'll keep learning though. I really enjoyed watching your conversation and seeing you try figuring out the workings of a brilliant mind. It's very helpful.

    • @MachineLearningStreetTalk
      @MachineLearningStreetTalk  Před 4 lety +1

      That's a wonderful anecdote Bianca! Thanks for sharing!

    • @aBigBadWolf
      @aBigBadWolf Před 4 lety +1

      50 miles makes no sense. You must have memorized the wrong numbers.

    • @benbridgwater6479
      @benbridgwater6479 Před 4 lety +2

      The answer is 50 miles, but the point of the story is that von Neumann, who solved it immediately, didn't use the "aha" method (trains meet in 1hr, ergo fly travels 50 miles since he's flying at 50mph). von Neumann instead instantantly formulated and summed the infinite series in his head.

    • @aBigBadWolf
      @aBigBadWolf Před 4 lety +2

      ​@@benbridgwater6479 Suddenly it stops when trains meet? What infinite series?
      Edit: So I looked it up. The fly is supposed to be faster than the train(!) and pinball between the trains until it gets squashed(!). With that info it makes sense. Here: www.infiltec.com/j-logic.htm

    • @BiancaAguglia
      @BiancaAguglia Před 4 lety +1

      @@aBigBadWolf 😁Yes and no. I got the wrong numbers but not because I memorized them wrongly. I didn't memorize them at all. It was the logic of the story that I remembered. Unfortunately I forgot the fly had to be faster than the trains for this to make sense at all. 😁

  • @Hannah-cb7wr
    @Hannah-cb7wr Před 4 lety +16

    The music makes this unwatchable

  • @konataizumi5829
    @konataizumi5829 Před 4 lety +1

    Yeah, like others have said, this sucks