Applied ML 2020 - 10 - Calibration, Imbalanced data

Sdílet
Vložit
  • čas přidán 6. 09. 2024

Komentáře • 41

  • @jeandersonbc
    @jeandersonbc Před 4 lety +9

    I was just discussing about this topic with my advisor. This is what I call perfect timing :D thank you very much for sharing high quality content on the internet! +1 subscribed

    • @AndreasMueller
      @AndreasMueller  Před 4 lety

      I'm glad if it helps! This lecture still needs a bit of polish, though I hope it has some good pointers.

    • @walkingdad1806
      @walkingdad1806 Před 4 lety +1

      @@AndreasMueller , could you please explain the difference between underconfident and overconfident classifiers in terms of predicting classes 0 and 1?

    • @JoaoVitorBRgomes
      @JoaoVitorBRgomes Před 3 lety

      @@walkingdad1806 at around 11:10 he says data point in x axis is the bin center, so I think there is an error in the slide.

  • @Users_291w
    @Users_291w Před 4 lety +1

    I was working on an imbalanced data. The video is great . Thanks for making the content publicly available.

  • @AndreasMueller
    @AndreasMueller  Před 4 lety +2

    As I mentioned, there was a bug in the balanced bagging classifier and the results are better than undersampling. The updated result are at amueller.github.io/COMS4995-s20/slides/aml-10-calibration-imbalanced-data/#45

  • @mabk1196
    @mabk1196 Před 3 lety +5

    @Andreas Mueller at the vey beginning: where are those number 0.16, 0.5, 0.84 come from? If it is averaged probabilities, then it should be 0.26, 0.5 and 0.85...

  • @elvisdias5094
    @elvisdias5094 Před 4 lety

    Didn't get much of the multiclass calibration but the balacing with that extra library was what I needed!! Thank you so much for these recorded lectures!

  • @offchan
    @offchan Před 2 lety

    32:40 I've been trying to get my head around this fitting and I have the exact same question about these points that are stuck at the top and bottom of the plot. Thanks for mentioning that.

  • @majusumanto9016
    @majusumanto9016 Před 4 lety +3

    Hi sir, can you explain how to calculate the numbers inside the parenthesis ? ( 0.16, 0.5, 0.84)

  • @marianazari8301
    @marianazari8301 Před 2 lety

    Really great explanation, I loved the video, thank you so much for this!

  • @AkshayKumar-xo2sk
    @AkshayKumar-xo2sk Před 2 lety

    @Andreas Mueller - In the top most bin, should the frequency of 1's be two? There are two 1's

  • @Han-ve8uh
    @Han-ve8uh Před 3 lety +1

    Why do we need to calibrate? I can't find any sources explaining it's practical use. Since calibration is a monotonic transformation that doesn't change ranks of results, i would expect it does not affect decision making at all? (I'm assuming people make decisions simply based on ranked choices). What are some real life scenarios where getting the exact probability right is so important? Or is it something of a "making the stats fit some theory better" kind of thing?

    • @AndreasMueller
      @AndreasMueller  Před 3 lety +2

      There are two very common practical use-cases: one is communicating predictions. Imagine going to a hospital and the diagnosis is "of the 100 people we looked at today, you ranked 89th in likelihood to have cancer". That seems basically useless as far as information goes. Similarly practically important is making cost-based decision (where cost could be dollars or hours worked or lives saved). Imagine knowing the cost of making a false negative or a false positive - or the win you get from making a true positive or true negative. It's actually quote common to have at least approximate knowledge of these costs. In this case, you need probabilities to translate the costs into a decision rule. Hth!

    • @Han-ve8uh
      @Han-ve8uh Před 3 lety

      @@AndreasMueller Thinking about this again, I have some ideas. Maybe one reason for calibrating is when the same person is presented with 2 different probabilities from 2 different classifiers, and he needs to resolve this inconsistency to know which number to trust. Another reason is people may have a personal threshold of taking action, maybe 70%, and if calibrating moved a prediction from 65 to 75 or vice versa, that may motivate taking action or vice versa.
      Great point I forgot about incorporating costs. Can I see accurate probabilities as important for calculating Expected Value of a single customer (represented by a single input vector to be predicted), like EV = Prob response x profit + (1-Prob response) x cost. In this case, over/under estimating probabilities could lead to worse decisions as compared to calculating EV from probablities provided by calibrated models?
      What do you think of the above 2 paragraphs?

    • @AndreasMueller
      @AndreasMueller  Před 3 lety

      @@Han-ve8uh yes that's it.
      I was a bit abstract with the point about costs but you got it exactly right!

    • @Corpsecreate
      @Corpsecreate Před 9 měsíci

      It's not needed. The idea of calibration comes from a very pervasive misunderstanding of the basics of classification modelling.

  • @shnibbydwhale
    @shnibbydwhale Před 2 lety

    Great lecture. One thing I am struggling with was the part at the beginning about how you said that you can have a model that has very well calibrated probabilities, but that the model can also be bad at making predictions or have a low accuracy/recall etc. If the probabilities are well calibrated and are representative of true probabilities, how can the model be bad at correctly classifying the data?

    • @AndreasMueller
      @AndreasMueller  Před rokem

      Not sure why I missed this question. Basically if you have two balanced classes, and a classifier predicts a probability of 0.5 for class one for each data point, the classifier is perfectly classified. For every point, it says it's 50% certain that it's class 1, and it's correct in 50% of cases, so it perfectly reflects its own uncertainty.

  • @danielbaena4691
    @danielbaena4691 Před 2 lety

    Thanks for the video!

  • @yussy552
    @yussy552 Před 3 lety

    Thank you so much for making these lectures public. Great lecture!. If I am training my model with Stratified cross validation, doesn't it deal with the imbalance? How are these more elaborate techniques different? Thanks

    • @AndreasMueller
      @AndreasMueller  Před 3 lety +1

      That depends a lot on what you mean by "dealing with". In scikit-learn, stratified cross validation actually does not do any undersampling or oversampling but instead ensures that the class proportions are stable across the folds. That means that if the data is imbalanced, then each split will be imbalanced in the same way. The goal of that is to provide a more stable and reliable estimate of generalization performance given the imbalance.

  • @shubhamtalks9718
    @shubhamtalks9718 Před 3 lety +1

    6:57 I did not understand how the expected positive for 'bin0 is 0.16', 'bin1 is 0.5' and 'bin2 is 0.84'?

    • @AndreasMueller
      @AndreasMueller  Před 3 lety

      it's the mid-points of the bins (which is the same as their average value), for bins [0, 1/3], [1/3, 2/3], [2/3, 1].

    • @shubhamtalks9718
      @shubhamtalks9718 Před 3 lety

      @@AndreasMueller Are the bins created at equal intervals or does each bin contain same no. of datapoints?

    • @AndreasMueller
      @AndreasMueller  Před 3 lety

      @@shubhamtalks9718 Equal intervals, they are just uniformly spaced. And in an actual application you would usually use at least 10 bins, but I simplified to 3 here for illustration purposes.

    • @shubhamtalks9718
      @shubhamtalks9718 Před 3 lety

      @@AndreasMueller Got it. Thanks for the wonderful lecture.

  • @AkshayKumar-xo2sk
    @AkshayKumar-xo2sk Před 2 lety

    How did you get 16, 50 and 84% values? I mean for each bin, you have different percentage values. How did you get that?

    • @AnujKatiyal
      @AnujKatiyal Před 2 lety +1

      3 equal buckets, 0-33, 33-67, 67-100. Means of these buckets are 16, 50 and 84.

  • @chiragsharma9430
    @chiragsharma9430 Před 2 lety

    Can we use Calibrated classifiers for multi-class Classification problems?
    If yes can you please provide jupyter notebook demonstrating that?
    And thanks for uploading these video's.

    • @AndreasMueller
      @AndreasMueller  Před 2 lety

      There's an example in the scikit-learn documentation: scikit-learn.org/stable/auto_examples/calibration/plot_calibration_multiclass.html you can download it as a notebook at the bottom.

  • @mdichathuranga1
    @mdichathuranga1 Před 2 lety

    So if you had 10 data points which the model predicted as True and if we get the mean probability of those 10 data points as 0.95 , but when we manually check those 10 data points and found out that only 8 of those data points are actually true which give the percentage of 0.8 , then we can conclude that for the data points which are in the 0.8 - 1 bin the model was over confident ….. Am i right ?

  • @Corpsecreate
    @Corpsecreate Před 9 měsíci

    You don't need calibration ever. This is so silly haha

  • @tahirullah4786
    @tahirullah4786 Před 3 lety

    that's great but from where we get the code of this video?

    • @AndreasMueller
      @AndreasMueller  Před 3 lety +1

      Link to the material is in the description. This lecture is at github.com/amueller/COMS4995-s20/tree/master/slides/aml-10-calibration-imbalanced-data

  • @teetanrobotics5363
    @teetanrobotics5363 Před 4 lety

    Sir Could you please upload the theoretical Machine Learning Course Counterpart?

    • @yuhuang8447
      @yuhuang8447 Před 4 lety +2

      Hi I think he only gives lecture on the application part and the theoretical part is given by some other professors.

  •  Před 4 lety

    Awesome! Keep it up! Would you like to be CZcams friends? :]

  • @Corpsecreate
    @Corpsecreate Před 9 měsíci

    47:80 I can help you with that. These methods NEVER help.