Video není dostupné.

Omlouváme se.

Undersampling for Handling Imbalanced Datasets | Python | Machine Learning

Bhavesh Bhatt

zhlédnutí 27 324

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 4. 05. 2019
Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam.
This even distribution is not always possible. I'll discuss one of the techniques known as Undersampling that helps us tackle this issue.
Undersampling is one of the techniques used for handling class imbalance. In this technique, we under sample majority class to match the minority class.
If you do have any questions with what we covered in this video then feel free to ask in the comment section below & I'll do my best to answer those.
If you enjoy these tutorials & would like to support them then the easiest way is to simply like the video & give it a thumbs up & also it's a huge help to share these videos with anyone who you think would find them useful.
Please consider clicking the SUBSCRIBE button to be notified for future videos & thank you all for watching.
You can find me on:
GitHub - github.com/bha...
Medium - / bhattbhavesh91
#ClassImbalance #Undersampling #machinelearning #python #deeplearning #datascience #youtube

Komentáře • 50

@dhananjaykansal8097 Před 4 lety ⁺²
This is awesome. Pls market more. The likes and comments doesn’t justify the kinda of work you’re doing. Obviously it might happen that you stop making frequent videos for obvious reasons, but I like to tell you I personally really liked your videos and your teaching style is straight forward and lucid. Thanks
@angelsandemons Před 2 lety
Amazing video, great teaching style i struggled for hrs and finally found this gem of a video , thank you so much!!
@bhattbhavesh91 Před 2 lety
Glad it was helpful!
@joseluismanzanares3662 Před 4 lety
Great. Very useful. I´m just facing this issue with a target varible in a classification model for lung cancer. THANK YOU
@ruhinehri5607 Před 2 lety
Awesome explanation... I was really struggling to balance a dataset... This video made my day...
@bhattbhavesh91 Před 2 lety
Glad it helped!
@abhijeetpatil6634 Před 5 lety
Thanks bhavesh, never stop making such videos
@powellmenezes584 Před 5 lety ⁺²
simple and easy - i appreciate you bro :) Subscribed and liked :P
@shreyachandra5175 Před 4 lety ⁺¹
Thank you! This was an excellent video and extremely helpful :)
@bhattbhavesh91 Před 4 lety ⁺¹
Glad it was helpful!
@manaralassf2896 Před 4 lety ⁺⁴
Please , could you tell us , why did you apply Undersampling to all the whole dataset? I think we should implement this technique on the training set, like what we should do with SMOTE?
@luismagana6347 Před 4 lety
Thanks, it has been clear for me, good vídeo.
@bhattbhavesh91 Před 4 lety
Great to hear!
@debatradas1597 Před 2 lety
Thank you so much Sir
@bhattbhavesh91 Před 2 lety
Most welcome
@Lion9781 Před 3 lety
Great video. Undersampling on the entire data set, so both train and test data, is a mistake though. Generally it can only be applied to the training set, otherwise the great performance will be misleading. Nonetheless, the code itself is nice.
@TejaDuggirala Před 5 lety
Great work bro.. helped me a lot ! Thank you so much! Liked and subscribed :)
@hemantsah8567 Před 3 lety ⁺¹
How will you perform sampling when you have target feature with more than 2 categories...?
@deutschvalley3574 Před 2 lety
Great explanation sir kindly make videos on performance all matrix how we can get best information our model and data
@bhattbhavesh91 Před 2 lety ⁺¹
Already uploaded
@sidgupta1957 Před 3 lety
Thanks for the explanation. When undersampling , the output scores that we get would be inflated/deflated depending upon the majority class( what I mean is that if the dependent variable takes values 1 and 0 and if the majority class is 0 , then we will get get inflated scores after the model is built). So how to factor in that?
@muza6322 Před 2 lety
Thank you
@bhattbhavesh91 Před 2 lety
You're welcome
@yasserothman4023 Před 3 lety
Why apply the undersampling on the whole dataset not the training set only ?
@agnibhohomchowdhury Před 5 lety
Your videos are very simple and easy to understand ... Love your work. Can u provide the code?
@sobinbabu984 Před 4 lety
How can we apply smote in dataset containing categorical variables? or should we apply onehotencoding before smote?
@user-dt8ei1wj2x Před rokem
hi there, I'm a little confusing at 4:44. You have the imbalanced data and split it without `stratify` method. But the model still can fit well. When I apply this to my imbalanced data, which the 0 is 582689 and 1 is 1296. It raise out error that says my X_train only got 1 class instead of 2. How can I do to solve this problem, I used `stratify` method but it is still not working. Really appreciate that.
@jagritisehgal3867 Před 3 lety
Thanks, nice work :)
@bhattbhavesh91 Před 3 lety
Glad you liked it!
@santanusarangi Před 3 lety
Hello,
Once we get the optimum threshold value, how to reset the threshold value?
@kinglovesudelhi Před 3 lety
Why we cannot use firth logistic by penalizing maximum likelyhood.
@pratapdutta4 Před 3 lety
So here we are splitting the data into test and train after under sampling?
@niranjanbehera4591 Před 5 lety
good one
@vaibhavmishra2283 Před 2 lety
I think there is a mistake in this.. Metrics values came to be that good because the test data was also balanced(as you performed undersampling on the entire dataset) . This would lead us to misleading result as we have never tested the imbalanced scenario , which unfortunately is the real case. We perform under or over sampling only on the training set and validate it with the imbalanced dataset only to make sure we get the correct results..
@taskynrakhym1542 Před 5 lety
Thanks Bro!!!!
@mr.techwhiz4407 Před 4 lety
Great video. Is this the same case if you use a Random Forest model?
@akashm103 Před 3 lety
dude i have a doubt what about the training accuracy does it goes down? I'm training a model which after oversampling has made testing accuracy to up but training accuracy went down.
@radcyrus Před 5 lety
Thank you :-)
@halilibrahimozkan9799 Před 3 lety
I have a question. I created a model. My data has 1 and 0. 1 is more than 0. I realize undersampling and oversampling. Undersampling is more less than oversampling as accuracy. Why is it?
@joseluismanzanares3662 Před 4 lety
Hi Bhavesh Bhatt , Just a question. I wonder if undersampling may be appropiate for my data set. minority class is 8.4% of data. With 6976 obs for minority and 83687 for majority. Any comments on this issue? Thanks
@karthicradha4834 Před 4 lety
Very interesting,easy to understand and follow all the steps. Btw I am facing issues with codings. While executing “generate_auc_roc_curve”.its showing name auc is not defined.
“Plt.plot(for,tot,label = “AUC ROC CURVE WITH area under the curve =“ +str(auc)).
Could you please explain me this line of code. Thanks
@bhattbhavesh91 Před 4 lety ⁺¹
If you have followed the process as shown in the video, it shouldn't give you an error! If its giving you an error then you are a google search away to get to the final solution!
@AbdullahQamer Před 4 lety
Can anyone answer this question please?
A dataset with the following numbers of instances for three classes A, B, and C shall be balanced:
A: 3100
B: 3200
C: 3600
a) How many instances does the dataset have in total after balancing with undersampling?
b) How many instances does the dataset have in total after balancing with oversampling?
@HACKINGMADEFUN Před 3 lety
under: 3100*3
over: 3600*3
@deepikadusane9051 Před 4 lety
Hii , i have seen ur all videos of imbalance dataset bt which one we should prefer the most over sampling , under sampling or o weights
@bhattbhavesh91 Před 4 lety
depends on your problem statement! Is your business ok to trust synthetic data? are you ok to lose out on data in case of under sampling? so, I can't give you a single answer!
@poojarani9860 Před 5 lety
HI BHavesh, I liked your video. I have a large amount if text data set of some violation data. I need to apply ML techniques to find the major key areas which are causing violation. Can yiu guide me how can i proceed. The data I am having is in excel and we cna apply supervise machine learning. I have also created manually the category for which I also tried to apply supervise machine lerning algo to predict the target variable. But my motto is not to find the target variable, My motto is to find the major key areas because of which violation exist. When I created category, I found around 90%data belongs to one category which is causing class imbalance.
@Neerajpl7 Před 5 lety
Good One 👌
@NaviVlogs76 Před 4 lety
sirr what about 3 clsses ? how to handle them ? it was really helpful
@HACKINGMADEFUN Před 3 lety
did you find any way to do that?

Další v pořadí

Automatické přehrávání

SMOTE (Synthetic Minority Oversampling Technique) for Handling Imbalanced Datasets