Video není dostupné.
Omlouváme se.

Undersampling for Handling Imbalanced Datasets | Python | Machine Learning

Sdílet
Vložit
  • čas přidán 4. 05. 2019
  • Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam.
    This even distribution is not always possible. I'll discuss one of the techniques known as Undersampling that helps us tackle this issue.
    Undersampling is one of the techniques used for handling class imbalance. In this technique, we under sample majority class to match the minority class.
    If you do have any questions with what we covered in this video then feel free to ask in the comment section below & I'll do my best to answer those.
    If you enjoy these tutorials & would like to support them then the easiest way is to simply like the video & give it a thumbs up & also it's a huge help to share these videos with anyone who you think would find them useful.
    Please consider clicking the SUBSCRIBE button to be notified for future videos & thank you all for watching.
    You can find me on:
    GitHub - github.com/bha...
    Medium - / bhattbhavesh91
    #ClassImbalance #Undersampling #machinelearning #python #deeplearning #datascience #youtube

Komentáře • 50

  • @dhananjaykansal8097
    @dhananjaykansal8097 Před 4 lety +2

    This is awesome. Pls market more. The likes and comments doesn’t justify the kinda of work you’re doing. Obviously it might happen that you stop making frequent videos for obvious reasons, but I like to tell you I personally really liked your videos and your teaching style is straight forward and lucid. Thanks

  • @angelsandemons
    @angelsandemons Před 2 lety

    Amazing video, great teaching style i struggled for hrs and finally found this gem of a video , thank you so much!!

  • @joseluismanzanares3662

    Great. Very useful. I´m just facing this issue with a target varible in a classification model for lung cancer. THANK YOU

  • @ruhinehri5607
    @ruhinehri5607 Před 2 lety

    Awesome explanation... I was really struggling to balance a dataset... This video made my day...

  • @abhijeetpatil6634
    @abhijeetpatil6634 Před 5 lety

    Thanks bhavesh, never stop making such videos

  • @powellmenezes584
    @powellmenezes584 Před 5 lety +2

    simple and easy - i appreciate you bro :) Subscribed and liked :P

  • @shreyachandra5175
    @shreyachandra5175 Před 4 lety +1

    Thank you! This was an excellent video and extremely helpful :)

  • @manaralassf2896
    @manaralassf2896 Před 4 lety +4

    Please , could you tell us , why did you apply Undersampling to all the whole dataset? I think we should implement this technique on the training set, like what we should do with SMOTE?

  • @luismagana6347
    @luismagana6347 Před 4 lety

    Thanks, it has been clear for me, good vídeo.

  • @debatradas1597
    @debatradas1597 Před 2 lety

    Thank you so much Sir

  • @Lion9781
    @Lion9781 Před 3 lety

    Great video. Undersampling on the entire data set, so both train and test data, is a mistake though. Generally it can only be applied to the training set, otherwise the great performance will be misleading. Nonetheless, the code itself is nice.

  • @TejaDuggirala
    @TejaDuggirala Před 5 lety

    Great work bro.. helped me a lot ! Thank you so much! Liked and subscribed :)

  • @hemantsah8567
    @hemantsah8567 Před 3 lety +1

    How will you perform sampling when you have target feature with more than 2 categories...?

  • @deutschvalley3574
    @deutschvalley3574 Před 2 lety

    Great explanation sir kindly make videos on performance all matrix how we can get best information our model and data

  • @sidgupta1957
    @sidgupta1957 Před 3 lety

    Thanks for the explanation. When undersampling , the output scores that we get would be inflated/deflated depending upon the majority class( what I mean is that if the dependent variable takes values 1 and 0 and if the majority class is 0 , then we will get get inflated scores after the model is built). So how to factor in that?

  • @muza6322
    @muza6322 Před 2 lety

    Thank you

  • @yasserothman4023
    @yasserothman4023 Před 3 lety

    Why apply the undersampling on the whole dataset not the training set only ?

  • @agnibhohomchowdhury
    @agnibhohomchowdhury Před 5 lety

    Your videos are very simple and easy to understand ... Love your work. Can u provide the code?

  • @sobinbabu984
    @sobinbabu984 Před 4 lety

    How can we apply smote in dataset containing categorical variables? or should we apply onehotencoding before smote?

  • @user-dt8ei1wj2x
    @user-dt8ei1wj2x Před rokem

    hi there, I'm a little confusing at 4:44. You have the imbalanced data and split it without `stratify` method. But the model still can fit well. When I apply this to my imbalanced data, which the 0 is 582689 and 1 is 1296. It raise out error that says my X_train only got 1 class instead of 2. How can I do to solve this problem, I used `stratify` method but it is still not working. Really appreciate that.

  • @jagritisehgal3867
    @jagritisehgal3867 Před 3 lety

    Thanks, nice work :)

  • @santanusarangi
    @santanusarangi Před 3 lety

    Hello,
    Once we get the optimum threshold value, how to reset the threshold value?

  • @kinglovesudelhi
    @kinglovesudelhi Před 3 lety

    Why we cannot use firth logistic by penalizing maximum likelyhood.

  • @pratapdutta4
    @pratapdutta4 Před 3 lety

    So here we are splitting the data into test and train after under sampling?

  • @niranjanbehera4591
    @niranjanbehera4591 Před 5 lety

    good one

  • @vaibhavmishra2283
    @vaibhavmishra2283 Před 2 lety

    I think there is a mistake in this.. Metrics values came to be that good because the test data was also balanced(as you performed undersampling on the entire dataset) . This would lead us to misleading result as we have never tested the imbalanced scenario , which unfortunately is the real case. We perform under or over sampling only on the training set and validate it with the imbalanced dataset only to make sure we get the correct results..

  • @taskynrakhym1542
    @taskynrakhym1542 Před 5 lety

    Thanks Bro!!!!

  • @mr.techwhiz4407
    @mr.techwhiz4407 Před 4 lety

    Great video. Is this the same case if you use a Random Forest model?

  • @akashm103
    @akashm103 Před 3 lety

    dude i have a doubt what about the training accuracy does it goes down? I'm training a model which after oversampling has made testing accuracy to up but training accuracy went down.

  • @radcyrus
    @radcyrus Před 5 lety

    Thank you :-)

  • @halilibrahimozkan9799
    @halilibrahimozkan9799 Před 3 lety

    I have a question. I created a model. My data has 1 and 0. 1 is more than 0. I realize undersampling and oversampling. Undersampling is more less than oversampling as accuracy. Why is it?

  • @joseluismanzanares3662

    Hi Bhavesh Bhatt , Just a question. I wonder if undersampling may be appropiate for my data set. minority class is 8.4% of data. With 6976 obs for minority and 83687 for majority. Any comments on this issue? Thanks

  • @karthicradha4834
    @karthicradha4834 Před 4 lety

    Very interesting,easy to understand and follow all the steps. Btw I am facing issues with codings. While executing “generate_auc_roc_curve”.its showing name auc is not defined.
    “Plt.plot(for,tot,label = “AUC ROC CURVE WITH area under the curve =“ +str(auc)).
    Could you please explain me this line of code. Thanks

    • @bhattbhavesh91
      @bhattbhavesh91  Před 4 lety +1

      If you have followed the process as shown in the video, it shouldn't give you an error! If its giving you an error then you are a google search away to get to the final solution!

  • @AbdullahQamer
    @AbdullahQamer Před 4 lety

    Can anyone answer this question please?
    A dataset with the following numbers of instances for three classes A, B, and C shall be balanced:
    A: 3100
    B: 3200
    C: 3600
    a) How many instances does the dataset have in total after balancing with undersampling?
    b) How many instances does the dataset have in total after balancing with oversampling?

  • @deepikadusane9051
    @deepikadusane9051 Před 4 lety

    Hii , i have seen ur all videos of imbalance dataset bt which one we should prefer the most over sampling , under sampling or o weights

    • @bhattbhavesh91
      @bhattbhavesh91  Před 4 lety

      depends on your problem statement! Is your business ok to trust synthetic data? are you ok to lose out on data in case of under sampling? so, I can't give you a single answer!

  • @poojarani9860
    @poojarani9860 Před 5 lety

    HI BHavesh, I liked your video. I have a large amount if text data set of some violation data. I need to apply ML techniques to find the major key areas which are causing violation. Can yiu guide me how can i proceed. The data I am having is in excel and we cna apply supervise machine learning. I have also created manually the category for which I also tried to apply supervise machine lerning algo to predict the target variable. But my motto is not to find the target variable, My motto is to find the major key areas because of which violation exist. When I created category, I found around 90%data belongs to one category which is causing class imbalance.

  • @Neerajpl7
    @Neerajpl7 Před 5 lety

    Good One 👌

  • @NaviVlogs76
    @NaviVlogs76 Před 4 lety

    sirr what about 3 clsses ? how to handle them ? it was really helpful