Handling imbalanced dataset in machine learning | Deep Learning Tutorial 21 (Tensorflow2.0 & Python)

codebasics

zhlédnutí 174 893

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 23. 09. 2020
Credit card fraud detection, cancer prediction, customer churn prediction are some of the examples where you might get an imbalanced dataset. Training a model on imbalanced dataset requires making certain adjustments otherwise the model will not perform as per your expectations. In this video I am discussing various techniques to handle imbalanced dataset in machine learning. I also have a python code that demonstrates these different techniques. In the end there is an exercise for you to solve along with a solution link.
Code: github.com/codebasics/deep-le...
Path for csv file: github.com/codebasics/deep-le...
Exercise: github.com/codebasics/deep-le...
Focal loss article: medium.com/analytics-vidhya/h....
#imbalanceddataset #imbalanceddatasetinmachinelearning #smotetechnique #deeplearning #imbalanceddatamachinelearning
Topics
00:00 Overview
00:01 Handle imbalance using under sampling
02:05 Oversampling (blind copy)
02:35 Oversampling (SMOTE)
03:00 Ensemble
03:39 Focal loss
04:47 Python coding starts
07:56 Code - undersamping
14:31 Code - oversampling (blind copy)
19:47 Code - oversampling (SMOTE)
24:26 Code - Ensemble
35:48 Exercise
Do you want to learn technology from me? Check codebasics.io/?... for my affordable video courses.
Previous video: • Dropout Regularization...
Deep learning playlist: • Deep Learning With Ten...
Machine learning playlist : • Machine Learning Tutor...
🌎 My Website For Video Courses: codebasics.io/?...
Need help building software or data analytics and AI solutions? My company www.atliq.com/ can help. Click on the Contact button on that website.
#️⃣ Social Media #️⃣
🔗 Discord: / discord
📸 Dhaval's Personal Instagram: / dhavalsays
📸 Instagram: / codebasicshub
🔊 Facebook: / codebasicshub
📝 Linkedin (Personal): / dhavalsays
📝 Linkedin (Codebasics): / codebasics
📱 Twitter: / codebasicshub
🔗 Patreon: www.patreon.com/codebasics?fa...
DISCLAIMER: All opinions expressed in this video are of my own and not that of my employers'.

Komentáře • 223

@codebasics Před 2 lety ⁺⁴
Do you want to learn technology from me? Check codebasics.io/ for my affordable video courses.
@tugrulpinar16 Před 3 lety ⁺⁷⁵
Those who are watching just recently, SMOTE function is "fit_resample" now. Also if you can't import imbalanced_learn properly, try restarting the kernel.
@krishnaprabeesh2415 Před 2 lety
Thank you
@sonalganvir8334 Před 2 lety
Will this work for categorical response too?
@sanjaydubey8036 Před rokem ⁺¹
@Ma Aleemit means n_jobs = -1, i.e. use all ur cores for processing
@bbom9197 Před rokem
Thank you
@iaconst4.0 Před 3 měsíci
gracias amigo!!
@magdalenawielobob9464 Před 3 lety ⁺²²⁹
Hi. You should perform under / over sample (including SMOTE) only on training data, and measure f1 on original data distribution (test data). Moreover, if you divide oversample data with train_test_split then you have no control over the distribution of duplicated items for test and train. Which means that you can have the same observation in both test and train, which means you test partially on the training set - that's why the results increase. So first divide into train / test, and then perform operations only on the training set, and the test set should be without any changes.
Still, it's a very good tutorial, it's nice that you share your knowledge !!
@charithaweerasooriya5941 Před 2 lety ⁺²
yes thats true
@vineetkumarmishra2989 Před 2 lety ⁺⁵
yeah, we should never touch the test set.
@MMSakho Před 2 lety ⁺²
True.., it might will be overfit right?
@Stenkyedits Před 2 lety
sad but true
@nithinmanjunath3909 Před 2 lety
@@MMSakho Yes you are right
@stanleypiche4705 Před 2 lety ⁺⁴
Thank you so much for sharing this interesting information about data transformation. I was training a neural network that gave an AUC of 0.85, after balancing the class with the SMOTE it reached 0.93 AUC. Obviously, the f1-score and accuracy also improved. Thanks!
@Rajdeepsharma1987 Před 3 lety
Thanks for providing us the path and please keep doing the good work and don’t get upset by lesser views you are a true inspiration for all of us.
@odaithalji9603 Před 10 měsíci
The way you are introducing the information is very very excellent, thanks for sharing your knowledge and I'm happy to watch your video
@venkatesanr9455 Před 3 lety ⁺¹
Thanks a lot, codebasics for all of your valuable and knowledgeable content
@tjbwhitehea1 Před 3 lety ⁺⁵⁵
Hey codebasics, love this video series! I think there’s a pretty big mistake in the oversampling though. You upsample, then do train test split. This means that there will be overlapping samples in both train and testing data, so the model will have already have seen some of the data you are testing it on. I think you need to do your train test split then do the upsampling on the train data only.
@shivi_was_never_here Před 2 měsíci ⁺¹
Yup, that's true. My professor said you should always oversample after splitting the data, and undersample before. If you oversample before splitting the data, your model will be in danger of overfitting.
Yay, go me, commenting on a 3 year old comment!
@muhammadariowinaya6753 Před měsícem
i already SMOTE the data only for train set, but the result of neural network model still bad. What should i do?
@muhammadariowinaya6753 Před měsícem
@@shivi_was_never_herei already SMOTE the data only for train set, but the result of neural network model still bad. What should i do?
@tchintchie Před 3 lety ⁺⁵
I always learn something new watching your videos. Thank you 🙏🏻
@codebasics Před 3 lety ⁺³
I'm so glad!
@manansharma4268 Před 3 lety ⁺⁷
Thank you very much for this video. This actually helps in solving real world scenarios.
@codebasics Před 3 lety ⁺¹
:)
@CarolynPlican Před 2 lety
Thank you. Very clear instruction and linked to Ann too, as I've only used with supervised ml.
@ybbetter9948 Před 2 lety
Great presentation! I think I just needed SMOTE for my assignment but I liked how you explained every method.
@honeyBadger582 Před 3 lety
i was actually doing the churn modeling project and this video popped up! thanks a lot :)
@codebasics Před 3 lety
Glad I could help!
@gurkanyesilyurt4461 Před 11 měsíci
Thank you again Dhaval. I really appreciate your efforts!!
@shylashreedev2685 Před 2 lety
Hats off to u Dhaval, Loved ur way of teaching and clearing my concepts, thank u so much
@muhammadhollandi2586 Před 3 lety
very helpful, your video makes everything easier ,thousand thumbs up for you 👍👍
@codebasics Před 3 lety
Glad it helped!
@farhodkalonov9370 Před 3 lety
Thank you so much and appreciate for your work.
@GuilhermeOliveira-se1th Před 3 lety
You answered my question with only 4 minutes. Great! thank you!
@codebasics Před 3 lety
Happy to help!
@RoyalRealReview Před 2 lety
@@codebasics if we have ratio of data in 54% and 46%. Do we need balancing?
@twinkazz Před 2 lety
Great content thanks. Nice and entertainin at times
@behrozjurayev5702 Před 2 lety
🤩 love your tutorials brother
@yogeshbharadwaj6200 Před 3 lety
Only in this video looks like your patience was out of your control sir....huhaaaa....but still quality content delivery and great explanation....Tks a lot Sir....
@sandiproy330 Před rokem
Wonderful video. Great effort. Thank you.
@codebasics Před rokem
Glad you enjoyed it!
@sararamadan1907 Před 3 lety
Great explanation
@AlgoTradeArchitect Před rokem
Thank you for your sharing.
@fahadreda3060 Před 3 lety
Great video as usual sir , wish you more success
@codebasics Před 3 lety ⁺¹
So nice of you. I hope you are doing good my friend fahad.
@paulowiz Před 3 měsíci
So fun the laugh at 22:31 hehe really cool video!
@johnmasalu8703 Před 3 lety
Very useful and fruitful, big up
@codebasics Před 3 lety
Glad it was helpful!
@vanajagokul5937 Před rokem
Thank you so much. It was very informative.
@codebasics Před rokem
Glad it was helpful!
@josebordon46 Před 3 lety ⁺³
thanks for the great content, for the ensemble method could we use a random sample of the majority class (n=minority class length) then we could create more models for the vote
@JACKBLACK-jt8nw Před rokem
excellent approach very helpfull
@muhammadbasilkhan1829 Před 6 měsíci
thanks for these good vide
os these are very help full for me
@spicytuna08 Před 2 lety
awesome. cannot thank you enough
@aditya_01 Před 2 lety
video is really helpful.Thanks for sharing.
@codebasics Před 2 lety ⁺¹
Glad it was helpful!
@riazrahman7147 Před 2 lety
Thank you so much.
@raj-nq8ke Před 2 lety
Perfect explanation
@codebasics Před 2 lety
Glad you think so!
@emmanouilmorfiadakis118 Před rokem
JUST THE BEST
@mitalikatoch9404 Před 3 lety ⁺¹
Hey, great video.
Can you also make one video on how to handle the class overlapping (that too in imbalanced binary classification)??
Thank you
@siddharthkulkarni409 Před 2 lety ⁺³
I think we should first apply train test split and then over/under sample the train data.
@anuppudasaini6302 Před 2 lety
Good experiments with different methods! How about Auto-encoders methods? You encode and decode all good data (customer staying per your example) within DNN, calculate its reconstruction error. Now you run customer leaving data in your model. If your error from customer leaving data is not within the reconstruction error (from your staying data), then you have detected an anomaly. What do you think?
@turalkerimov4022 Před 3 lety
Best Teacher!!!!!
@codebasics Před 3 lety
👍😃
@asiastoriesmedia519 Před 23 dny
Thanks!
@harshalbhoir8986 Před rokem
Thank you sir
@user-gv8fb8xi2l Před 2 lety
Great video !
i'll thank you with subscription
@vinodkinoni4863 Před 3 lety
u r awesome teacher plz stay with us long live
@codebasics Před 3 lety
thanks for your kind wishes Vinod
@halafazel2745 Před rokem
awesome
@raom2127 Před 2 lety
Nice tutorial seen on this Topic Excellent Teaching....Could you please post Topics on supervised learning and unsupervised learning separately to know learn on sequense basis.
@mprasad3661 Před 3 lety
Great explanation bro
@codebasics Před 3 lety
Glad you liked it
@avisimkin1719 Před rokem ⁺²
nice video, pretty clear. I think there are 2 things that are missing though:
1) Doing the under/oversampling only on training data
2) You could have also choose a different operating point (instead of np.round(y_pred), taking a different threshold) , or just using AUC measure and not rounding at all, that could have been more indicative
PS: SMOTE don't actually give any lift in AUC measure. you off just as well adjusted the threshold to y_pred>0.35 or something like that and get better F1 scores
@maor940 Před rokem
True. Good points!
@MaximityL Před rokem ⁺¹
My thoughts exactly. Nice!
@amins6695 Před 3 lety
Amazing video. One question. What if I use under/over sampling and accuracy or precision decrease?
Single or combined under/over sampling methods let us to use features for further methods, for example, training multiple weak learners and then use ensemble methods. Is it possible for ensemble resampling methods?
@fattahmuhammadtahabi945 Před rokem
Really helpful. Could you please tell whether oversampling strategy is okay if we do cross-validation instead of train-test-split?
@ubannadan-ekeh7781 Před 3 lety ⁺¹
This is very insightful... thank you.
Please can you do a video on Click through rate prediction
@codebasics Před 3 lety
sure
@flaviobrienza6081 Před rokem
In my opinion the SMOTE part is not wrong, but it is tricky. Using SMOTE on the entire dataset will make the X_test performance much better for sure since it will predict values already seen. Instead, if you split your data before the SMOTE you can see that the performance improves, but not too much, it will not reach 0.8 if without SMOTE was 0.47. The X_test in the video could probably interpreted as the X_validation, and the testing data should be imported from other sources, or at the beginning the dataset should be divided into training and test, like on Kaggle.
@emanal-harbi2004 Před 2 lety
thanks, amazing illustration , do these methods work with multi-class labels ( means the lable column may contain over 10 labels)
@Nikki-jf5ep Před 3 lety
Thank you..
@ashishdewangan485 Před 2 lety ⁺²
Hi @codebasics. I find your tutorial series very informative and interesting. I am learning a lot from your videos.
I have a doubt in ensemble technique. While voting you are taking votes from three different predictions. But those predictions are not for the same data set. Is voting ensemble valid for such cases?
@afeezlawal5167 Před 2 lety
Same thought.
Voting isn't ideal
@sksahungpindia Před 3 lety ⁺⁵
Sir, Is there any better method from SMOTE for Class Imbalance? if yes please guide me...I am a Research Scholar (Doing Ph.D) from TOP 30 NIRF ranking institute. My area of research is classification problem in machine learning including dealing with imbalance data set. Thank you
@sandiproy330 Před rokem
Tremendous respect sir, I love your tutorial. I sincerely follow your tutorial and practice all exercises that you provide. However, I went through some comments for this video lecture and found that people are suggesting to oversample/SMOTE the training sample only, and not to disturb the test sample (which I too believe is quite apparent, as this will avoid duplicate or redundant entry in training and test data set). Hence, separated out the train and test datasets first, then applied the oversample/SMOTE technique on the training dataset only. Unfortunately, the precision, recall, and f1-score are not increasing for the minority class. This is quite logical though. What I understood is, duplicate entry of the same sample in both the train and test dataset was the reason for that huge increase in minority class precision, recall, and f1-score in your case.
@sandiproy330 Před rokem
This happened when I tried the second exercise of the Bank customer churn prediction problem. Oversampling/SMOTE on train data gives around 0.51, 0.63, and 0.56 for precision, recall, and f1-score. When I follow your method for the Bank customer churn problem, the figures are 0.77, 0.90, and 0.83 respectively.
@annperera6352 Před 3 lety
Hello Sir .i was looking everywhere for class imbalance problem.Thanks a lot for this video. Do you have any videos for implementing rule based classification?
@roshanpeter9904 Před 3 lety
In the ensemble method code, is it okay to split the data into batches first and then apply the train_split and train it for each, and then take the majority?
@dakshbhatnagar Před rokem
Hey Dhaval. Great Video however I have a question. Will using class_weight parameter in Tensorflow and assigning the values based on the occurrence of the classes create any sort of bias towards some classes?? Can class_weight be helpful for handling the imbalance and not doing any sampling of any kind??
@towhidsarwar1915 Před 3 lety ⁺¹
sir, I am following your deep learning playlist. please make a video on cross validation with keras for neural network.
@codebasics Před 3 lety ⁺²
sure
@abhaygodbole9194 Před 2 lety
Hello Dhaval,
Very Nice explanation.. Does SMOTE work for highly imbalanced data like I have data set where one class has less than 1% representation in the distributions ?
Please clarify
@tallandenglish Před rokem
Great stuff, but an error I believe. AT 31:07, in the ensemble method, you've used the function 'get_train_batch' to get X_train and y_train, but you're not redefining X_test and y_test
@anshi6205 Před 3 lety
Thankyou so much🌈🌈
@codebasics Před 3 lety
You’re welcome 😊
@sergiochavezlazo5362 Před rokem
Hi! Why dont directly use the train_test_split with the stratify argument? Thank u!
@peterjohngerero150 Před 10 měsíci
Its a great tutorial! But i have a comment in the evaluation part. you applied Resampling first before splitting the data. So its possible that there's a leakage of data coming from the training to the test set. Right? thats why it has a equal prediction score. Its a good technique that you should split the data set first and then resample only the training set. Hope this helps. Thanks
@pandharpurkar_ Před 3 lety
Do we need to balance original X, y datasets or only training sets x_train, y train?
@yogeshwarshendye4857 Před 3 lety ⁺¹
Sir, can I use the methods used in this tutorial for training my image classification model or should I use augmentation for that purpose?
@fonyuyborislami8034 Před 3 lety ⁺¹
I think for image classification, augmentation is a better approach.
@rohanpatnaik7348 Před 6 měsíci
00:00 Overview
00:01 Handle imbalance using under
sampling
02:05 Oversampling (blind copy)
02:35 Oversampling (SMOTE)
03:00 Ensemble
03:39 Focal loss
04:47 Python coding starts
07:56 Code - undersamping
14:31 Code - oversampling (blind copy)
19:47 Code - oversampling (SMOTE)
24:26 Code - Ensemble
35:48 Exercise
@nurulfadillah1248 Před 18 dny
Undersampling 7:34
Oversampling 15:04
@abhishekbhardwaj7783 Před 2 lety
Hi Sir, can you please tell me about how to augment data(not image data) for regression problems?
@sagarhm2237 Před 3 lety
Sir after training the data in deep learning like images after trainning the data is required?
@sakalagamingyt3563 Před 18 dny ⁺¹
31:40 the ANN function is using the same old X_test and y_test. I think that's why the accuracy is so bad.
@devarakondahimaja8423 Před 2 lety
Sir, can you please also add adasyn sampling technique and also other different sampling techniques. Differences between SMOTE vs ADASYN
@harperjmusic Před 3 lety ⁺²⁵
Don't you want to apply SMOTE just to the training data, and leave the test data untouched?
@lorizhuka6938 Před 3 lety ⁺³
True. Smote musst be appied after train test split.
@MrMadmaggot Před rokem
@@lorizhuka6938 What about the others? Oversampling for instance.
@soumyadev100 Před 2 lety
Seems, we should not calculate accuracy on train sample, for oversampling it is pretty obvious that precision recall will improve. We need to test the accuracy on test sample, where we artifically have not increase or decrese the number of samples.
@saikatroy3818 Před 3 lety
I think, in the same way , a method get_test_batch() also is required.
@dipankarrahuldey6249 Před 2 lety ⁺²
I think there's also a risk of overfitting the model when using SMOTE, as the synthetic data points might look like test data points(unseen).
@MMSakho Před 2 lety
That's true. Especially if the data is in text
@MrMadmaggot Před rokem
@@MMSakho Anyone managed to know if that's truth?
@soheilsaffary3671 Před 2 lety
are there similar methods for balancing datasets with continuous targets?
@mayankseth1235 Před 3 lety
Do you have video for imbalanced data for text classification problem. Please suggest.
@subhamsekharpradhan297 Před 3 lety
Sir when I ran the code I got this error:
AttributeError: module 'matplotlib' has no attribute 'get_data_path'
what can be done for it?
@omeryalcn5797 Před 3 lety
Thanks for sharing, but i think, there is a problem for test metric. Because you use processed data for training( oversampling etc., that is okay ) but you can not use same preprocessed data for testing, because in real state you can not know test data target, so you can not use imbalanced technics. Firstly you should seperate data and only apply implanced process for train data and test without preprocessed test data.
@rohitkulkarni9038 Před 4 měsíci
Which is the Best method to do the sampling before Spiting the dataset or After Splitting the dataset
@ariouathanane Před 3 lety
it's is possible to use smote for multi-class text classifcation please?
@sksahungpindia Před 3 lety ⁺¹
Sir, please clear my doubt. in method-2 ie Oversampling when we use train_test_split method the precision,recall and f1-score value is not look realistic because my test data is not unique (means trained data is already is in test data because of oversampling). please clarify? Thanks
@piyushdandagawhal8843 Před 3 lety
True, when you over sample there is a good chance that there will be data leakage. It would be helpful if you split the data and then oversample the train data to avoid any influence on the result.
@sksahungpindia Před 3 lety
@@piyushdandagawhal8843 Thank you Piyush. Please suggest me some research direction on Handling imbalanced data set in machine learning and Deep Learning. I am a full time research scholar so your suggestions mean a lot for me.
Thank you
@tirthpatel3491 Před 2 lety
Thanks for sharing it. I am wondering that how we can treat imbalance dataset of time series ? Can all mentioned techniques in video be performed on timeseries data?
@naveenkumarmangal9653 Před 2 lety
In general, it depends on type of data. Most of the imbalanced time-series dataset can be handled using SMOTE approach or combination of SMOTE with ENN/TOMEK.
@prachi6160 Před rokem
Can we use variational auto encoder for synthetic data generation in case of minority class?
@bheemeshg4823 Před 2 lety
After balancing the dataset may I know what values can be placed in that place
@Nick-tt9lh Před 2 lety
do we need to check for imbalance for unsupervised learning problem or clustering problem?? if yes, why and how??
@aniljhurani8289 Před 11 měsíci
Very interesting, amazing video...at 22:34 when using SMOTE method , smote.fit_sample(X,y) is now smote.fit_resample(X,y).
@taiconley3195 Před 3 lety ⁺¹
These videos are great, thank you very much! I have a follow up question, which is not discussed here. You would expect precision, recall, and f1 scores to improve with these methods, however, it is somewhat artificial because we are providing your methods without witholding a validation data set that hasn't been sampled (only test and train). To ask state another way, how would we expect these 'improved models' to work in a production environment, where new data isn't oversampled?
@akshitmiglani5419 Před 2 lety
+1
@AmarPalSingh-tn3sh Před rokem
does this approach work for more than 2 categories in Target variable?
@siddharthsingh2369 Před 2 lety ⁺²
Could someone elaborate a little bit on how exactly data is getting overlapped. I see many people saying to first split data and then sample it, will it work because here in this video we are dividing class 0 and 1 well in advance and then combining the data. I am going through many comments on this issue and having a hard time to figure this out.
@MrMadmaggot Před rokem
Did u manage to figured it out
@DJ-jf4qg Před rokem ⁺²
In over sampling minority class By Duplication
if we duplicate minority class then both classes will have equal samples
After that we use train-test -split which randomly selects samples.
The problem is those duplicate samples will be present in training samples as well as testing samples thus increasing Precision,F1score and all of those.
Here is the overlappping
@NguyenNhan-yg4cb Před 3 lety ⁺¹
you look so sleepy bro, just make sure you stay alerty to deal with any troubles, just kidding man lol. Best wishes for your contry
@shrinidhi4643 Před 3 lety
Please make a vedio on abstract dialogue summarization !! Where the same problem of imbalanced dataset occurs ...
@codebasics Před 3 lety
sure
@ariouathanane Před 3 lety ⁺¹
Please some one can explain me, why in this example (on video) the accuracy and loss frequently changed? is this an overfitting?
@otsogileonalepelo9610 Před 2 lety ⁺¹
I also had a similar observation in all videos in this series
@karangadgil9847 Před 2 lety
can we use SMOTE while working with audio dataset ?
@aomo5293 Před 9 měsíci
Is it the same process in multi label classification ?
@sagarhm2237 Před 3 lety
After finishing the data training the data is required ?
@patelajay1010 Před 3 lety
I have one doubt. What if data contains Nan values and you want to do under_sampling? If you impute Nan values with Mean() then there will be information leakage because we impute data before splitting it into train and test dataset. Could you please tell me what should be the possible solution in this case?
@randomdude79404 Před 3 lety
There are a multitude of different ways to impute the NaN values you don't have to always impute with the mean. You can even disregard them if they compromise a small portion of the dataset. Also I think what can be taken from this video is that SMOTE is the go to technique when we have an imbalanced dataset.

Další v pořadí

Automatické přehrávání

Applications of computer vision | Deep Learning Tutorial 22 (Tensorflow2.0, Keras & Python)