25. Stochastic Gradient Descent

Sdílet
Vložit
  • čas přidán 24. 07. 2024
  • MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018
    Instructor: Suvrit Sra
    View the complete course: ocw.mit.edu/18-065S18
    CZcams Playlist: • MIT 18.065 Matrix Meth...
    Professor Suvrit Sra gives this guest lecture on stochastic gradient descent (SGD), which randomly selects a minibatch of data at each step. The SGD is still the primary method for training large-scale machine learning systems.
    License: Creative Commons BY-NC-SA
    More information at ocw.mit.edu/terms
    More courses at ocw.mit.edu

Komentáře • 72

  • @elyepes19
    @elyepes19 Před 3 lety +19

    For those of us who are newcomers in ML, it's most enlightening to know that unlike "pure optimization" that aims to find the most exact minimum possible, ML aims instead to be "close enough" to the minimum in order to train the ML engine, if you get " too close" to the minimum an over-fit of your training data might occur. Thank you so much for the clarification

  • @rogiervdw
    @rogiervdw Před 4 lety +33

    This is truly remarkabe teaching. Greatly helps understanding and intuition of what SGD actually does. prof. Sra's proof of SGD convergence for non-convex optimization is in prof. Strang's excellent book "Linear Algebra & Learning From Data", p.365

  • @rembautimes8808
    @rembautimes8808 Před 2 lety +4

    Amazing for MIT to make such high quality lectures available worldwide. Well worth time investment to go thru these lectures. Thanks Prof Strang & Prof Suvrit & MIT

  • @trevandrea8909
    @trevandrea8909 Před 24 dny

    I love the way the professor teaches in this lecture and video. Thank you so much!

  • @Vikram-wx4hg
    @Vikram-wx4hg Před 3 lety +4

    What a beautiful beautiful lecture!
    Thank you Prof. Suvrit!

  • @BananthahallyVijay
    @BananthahallyVijay Před 2 lety

    Wow! That was one great talk. Prof. Suvrit Sra's done a great job in giving examples just light enough to drive the key ideas of SGD.

  • @schobihh2703
    @schobihh2703 Před 9 měsíci +1

    MIT is simply the best teaching around. Really deep insights again. Thank you.

  • @sukhjinderkumar2723
    @sukhjinderkumar2723 Před 2 lety +2

    Hands Down one of the most intersting lectures, The way Professor showed reseach ideas here and there and almost everywhere just blows me away, It was very very intersting, and best part is it is afforable to non-Math guys too, (thought its coming from a maths guy, however I feel like math part of very little, it was more towards intuitive side of SGD)

  • @cobrasetup703
    @cobrasetup703 Před 2 lety +1

    Amazing lecture, i am delighted by the smooth explanation of this complex topic! Thanks

  • @JatinThakur-dv7mt
    @JatinThakur-dv7mt Před rokem +8

    Sir you are a student from lalpani school shimla. You were the topper in +2. I am very happy for you. You have reached at a level where you truly belonged to. I wish you more and more success.

    • @ASHISHDHIMAN1610
      @ASHISHDHIMAN1610 Před rokem

      I am from Nahan, and I’m watching this from Ga Tech :)

  • @minimumlikelihood6552

    That was the kind of lecture that deserved applause!

  • @rababmaroc3354
    @rababmaroc3354 Před 4 lety

    well explained, thank you very much professor

  • @jfjfcjcjchcjcjcj9947
    @jfjfcjcjchcjcjcj9947 Před 4 lety +1

    Very clear and nice, to the point.

  • @scorpio19771111
    @scorpio19771111 Před 2 lety

    Good lecture. Intuitive explanations with specific illustrations

  • @tmusic99
    @tmusic99 Před 2 lety

    Thank you for an excellent lecture! Give me a clear track for development.

  • @RAJIBLOCHANDAS
    @RAJIBLOCHANDAS Před 2 lety +1

    Really extraordinary lecture. Very lucid but highly interesting. My research is on 'Adaptive signal processing'. However, I enjoyed this lecture most. Thank you.

  • @NinjaNJH
    @NinjaNJH Před 4 lety +2

    Very helpful, thanks! ✌️

  • @holographicsol2747
    @holographicsol2747 Před 2 lety

    Thank you, you are an excellent teacher and I learned, thank you

  • @nayanvats3424
    @nayanvats3424 Před 4 lety +1

    couldn't have been better....great lecture.... :)

  • @georgesadler7830
    @georgesadler7830 Před 2 lety +1

    Professor Suvrit Sra, thank for a beautiful lecture on Stochastic Gradient Descent and it's impact on machine learning. This powerful lecture help me understand something about machine learning and it's overall impact on large companies.

  • @taasgiova8190
    @taasgiova8190 Před 2 lety

    Fantastic, excellent lecture thank you.

  • @KumarHemjeet
    @KumarHemjeet Před 3 lety

    What an amazing lecture !!

  • @josemariagarcia9322
    @josemariagarcia9322 Před 4 lety

    Simply brilliant

  • @benjaminw.2838
    @benjaminw.2838 Před 8 měsíci

    Amazing class!!!!!!!!!!!! not only for ML researchers but also for ML practitioners.

  • @anadianBaconator
    @anadianBaconator Před 3 lety +1

    this guy is fantastic!

  • @hj-core
    @hj-core Před 9 měsíci

    An amazing lecture!

  • @vinayreddy8683
    @vinayreddy8683 Před 4 lety

    Prof assumed all the variables are scalars so, while moving loss towards down hill or local minimum; how does loss function is guided to minimum without any directions (scalar property)

  • @BorrWick
    @BorrWick Před 4 lety +2

    i think there is a very small mistake in the graph of (a_i*x-b)^2. The confusion area is bound is not a_i/b_i but b_i/a_i

  • @gwonchanyoon7748
    @gwonchanyoon7748 Před 2 měsíci

    beautiful class room!

  • @TrinhPham-um6tl
    @TrinhPham-um6tl Před 3 lety

    Just a litte typo that I came across throught out this perfect lecture is the "confusion region": min(a_i/b_i) and max (a_i/b_i) should be min(b_i/a_i) and max (b_i/a_i).
    Generally speaking, this lecture is the best explanation on SGD I have ever seen. Again, thank you prof. Sra and thank you MITOpenCourseWare so so much 👍👏
    P/s: Any other resources that I've read explained SGD so complicatedly 😔

  • @pbawa2003
    @pbawa2003 Před 2 lety

    This is Gr lecture though took me little time to prove the gradient descent lies in range of region of confusion with min and max been individual sample gradients

  • @cevic2191
    @cevic2191 Před 2 lety

    Many thanks Great!!!

  • @fatmaharman3842
    @fatmaharman3842 Před 4 lety

    excellent

  • @haru-1788
    @haru-1788 Před 2 lety

    Marvellous!!!

  • @notgabby604
    @notgabby604 Před rokem

    Very nice lecture. I will seeming go off topic here and say that an electrical switch is one-to-one when on and zero out when off. When on 1 volt in gives 1 volt out, 2 volts in gives 2 volts out etc.
    ReLU is one-to-one when its input x is >=0 and zero out otherwise.
    To convent a switch to ReLU you just need a attached switching decision x>=0.
    Then a ReLU neural networks is composed of weighted sums that are connected and disconnected from each other by the switch decisions. Once the switch states are known then you can simplify the weighted sum composits using simple linear algebra. Each neuron output anywhere in the net is some simple weighted sum of the input vector.
    AI462 blog.

  • @3g1991
    @3g1991 Před 4 lety +7

    Anyone have the proof he didn't have time for regarding stochastic gradient in non-convex case.

  • @xiangyx
    @xiangyx Před 3 lety

    fantastic

  • @fishermen708
    @fishermen708 Před 5 lety +1

    Great.

  • @grjesus9979
    @grjesus9979 Před rokem

    So, when using tensorflow or keras, when you set batch size = 1, there is as many iterations as samples in the entire training dataset. So my question is where is the random in "stochastic" gradient descent coming from?

  • @MohanLal-of8io
    @MohanLal-of8io Před 4 lety +3

    what GUI software professor Suvrit is using to change the step size instantly?

    • @brendawilliams8062
      @brendawilliams8062 Před 2 lety

      I don’t know but it would have to transpose numbers of a certain limit it seems to me.

  • @JTFOREVER26
    @JTFOREVER26 Před 3 lety

    Can anyone here care to explain how in the example in one dimension, when choosing a scalar outside R it grants that the stochastic gradient and the full gradient has the same sign? (corresponding to 30:30 - 31:00 ish in the video) Thanks in advance!

    • @ashrithjacob4701
      @ashrithjacob4701 Před rokem

      Since f(x) can be thought of as a sum of quadratic functions ( each function corresponding to one data point) with a minima at bi/ai. When we are outside the region R, then the minima of all the functions lies on the same side to where we are and as a result all their gradients have the same sign

  • @kethanchauhan9418
    @kethanchauhan9418 Před 4 lety +1

    what is the best book or resource to learn the whole mathematics behind stochastic gradient descent?

    • @mitocw
      @mitocw  Před 4 lety +4

      The textbook listed in the course is: Strang, Gilbert. Linear Algebra and Learning from Data. Wellesley-Cambridge Press, 2019. ISBN: 9780692196380. See the course on MIT OpenCourseWare for more information at: ocw.mit.edu/18-065S18.

    • @brendawilliams8062
      @brendawilliams8062 Před 2 lety

      Does this view and leg of math believe there is an unanswered Reiman hypothesis?

  • @watcharakietewongcharoenbh6963

    How can we find his 5 lines proof of why SGD works? It is fascinating.

  • @neoneo1503
    @neoneo1503 Před 2 lety

    "shuffle" in practice or "random pick" in theory on 42:00

  • @Tevas25
    @Tevas25 Před 4 lety

    A link to the Matlab simulation prof Suvrit shows would be great

    • @techdo6563
      @techdo6563 Před 4 lety +14

      fa.bianp.net/teaching/2018/COMP-652/
      found it

    • @SaikSaketh
      @SaikSaketh Před 4 lety

      @@techdo6563 Awesome

    • @medad5413
      @medad5413 Před 3 lety

      @@techdo6563 thank you

  • @shivamsharma8874
    @shivamsharma8874 Před 4 lety

    please share slides of this lecture.

    • @mitocw
      @mitocw  Před 4 lety +2

      It doesn't look like there are slides available. I see a syllabus, instructor insights, problem sets, readings, and a final project. Visit the course on MIT OpenCourseWare to see what materials we have at: ocw.mit.edu/18-065S18.

    • @vinayreddy8683
      @vinayreddy8683 Před 4 lety +3

      Take a screenshots and prepare it by yourself!!!

  • @sadeghadelkhah6310
    @sadeghadelkhah6310 Před 2 lety

    10:31 the [INAUDIBLE] thing is "Weight".

    • @mitocw
      @mitocw  Před 2 lety

      Thanks for the feedback! The caption has been updated.

  • @tuongnguyen9391
    @tuongnguyen9391 Před rokem

    Where can I obtain professor sra's slide ?

    • @mitocw
      @mitocw  Před rokem +1

      The course does not have slides of the presentations. The materials that we do have (problem sets, readings) are available on MIT OpenCourseWare at: ocw.mit.edu/18-065S18. Best wishes on your studies!

    • @tuongnguyen9391
      @tuongnguyen9391 Před rokem +1

      @@mitocw Thank you, I think I gues I just noted everything down

  • @akilarasan3288
    @akilarasan3288 Před 9 měsíci

    I would use MCMC to compute n sum to answer 14:00

  • @robmarks6800
    @robmarks6800 Před 2 lety

    Leaving the proof as a cliffhanger, almost worse than Fermat…

    • @papalau6931
      @papalau6931 Před rokem

      You can find the proof by Prof. Survit Sra from Prof. Gilbert Strang's book titled "Linear Algebra and Learning from Data".

  • @brendawilliams8062
    @brendawilliams8062 Před 2 lety

    It appears that from engineering math view that there’s the problem.

  • @SHASHANKRUSTAGII
    @SHASHANKRUSTAGII Před 3 lety

    Andrew NG didn't explain it in this detail
    That is why MIT is MIT,
    Thanks professor.

  • @ac2italy
    @ac2italy Před 3 lety +1

    He cited images as an example for large feature set : nobody use standard ML for images, we use Convolution.

    • @elyepes19
      @elyepes19 Před 3 lety +1

      I understand he is referring to Convolutional Neural Networks as a tool for image analysis as a generalized example

  • @jasonandrewismail2029
    @jasonandrewismail2029 Před 11 měsíci

    DISAPPOINTING LECTURE. BRING BACK THE PROFESSOR