Handling Imbalanced Datasets SMOTE Technique

DataMites

zhlédnutí 49 230

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 9. 02. 2020
CODE: github.com/ashokveda/youtube_...
DATA : github.com/ashokveda/youtube_...
/ ashokveda

Komentáře • 232

@donaloleary5514 Před 3 lety ⁺¹
Thank you, Ashok! This is an outstanding explanation of a complex subject. You make it all feel very intuitive. Awesome stuff - I will look for more DataMites videos in the future!
@DataMites Před 3 lety
"Hi, Donal O'Leary,
Thanks for your comment and keep on visiting our channel for more and updated content."
@pandharpurkar_ Před 3 lety ⁺⁴
best teacher i have ever seen! Explaining in very proper way! in short time explaining exact things!!!
@DataMites Před 3 lety ⁺¹
Thank you!
@akshiwakoti7851 Před 4 lety ⁺¹
A real pro! Subbed this channel after watching first 3 minutes. Glad to have found it.
@DataMites Před 3 lety
Thank you so much.
@bhagwatchate7511 Před 4 lety ⁺²
Amazing in depth explanation! I was exactly searching for this type of explanation.. Thanks for sharing
@DataMites Před 3 lety
Glad it was helpful!
@alisalariyan6676 Před 3 lety
The best smote tutorial I've seen. Thanks
@DataMites Před 3 lety
Glad it was helpful!
@JainmiahSk Před 4 lety ⁺⁷
Data Mites is a hidden gem now but soon they will be a Brand for Data Science. Keep my note for Future.
@DataMites Před 4 lety ⁺¹
Thank you 😊
@SurajSingh-wn4wu Před 4 lety ⁺¹
Great Ashok.!! Genuinely liked your way of explanation in depth and the solution... Glad i landed on your page...
Thank You..!
@DataMites Před 3 lety
Thanks and welcome
@8sharkey8 Před 3 lety
Excellent content, brilliantly presented. Thank you. Subscribed.
@DataMites Před 3 lety
Thanks and welcome
@siddhantagarwal274 Před 4 lety ⁺¹
Nicely explained. Thanks!
@DataMites Před 3 lety
You're welcome!
@user-dn8uc5sc8l Před 6 měsíci
Wow sir liked u r session .please continue posting such videos
@bhanukiran4317 Před 3 lety
Great content sir !! Keep on spreading knowledge
@DataMites Před 3 lety
Thank you, Keep watching
@mohamedoutghratine8538 Před 4 lety
Amazing in depth explanation
@DataMites Před 3 lety
Thank you!
@user-km4hl8lx8x Před rokem ⁺¹
This is really helpful and thank you again!
@DataMites Před rokem
Glad it was helpful! Keep Watching!
@inspiritlashi9994 Před 2 lety
Thank you so much for the great tutorial.. As someone who does not have even the basic knowledge of python, I could learn many things from you, sir.
@DataMites Před 2 lety
Glad it was helpful!
@riorizki4211 Před 3 lety
Great video and explanation! Thanks.
@DataMites Před 3 lety
You're welcome!
@osamaamir9311 Před rokem ⁺¹
Such an amazing topic
@DataMites Před rokem
Thank You
@MLA263 Před rokem
Thanks Ashok, very clear and simple explanation.
@DataMites Před rokem ⁺¹
Thank You
@binoypaul9772 Před 3 lety
Nice and informative. Please keep up the good work.
@DataMites Před 3 lety
Thank you.
@tanvipataskar4597 Před 4 lety
Amazing Explanation!!! Thankyou.
@DataMites Před 3 lety
You are welcome!
@ChrisHalden007 Před rokem ⁺¹
Great video. Thanks
@DataMites Před rokem
Glad you like it! Keep Supporting
@defres15 Před 2 lety
Great video. Great explanation. Thank you
@DataMites Před 2 lety
You are welcome!
@Cobra-bo1fy Před 2 lety
excellent explanation!
@DataMites Před 2 lety
Thank you.
@babukoshy Před 3 lety
This was a great lesson. Thanks a lot
@DataMites Před 3 lety
You're very welcome!
@manishbolbanda9872 Před 3 lety
wonderfully explained.thank you.
@DataMites Před 3 lety
You are welcome!
@alishahsaber3795 Před 2 lety
Thank you so much!!! Really helpful. thanks
@DataMites Před 2 lety
Glad it helped!
@milliekim5072 Před 2 lety ⁺¹
Thank you so much, sir! I hope I see more videos
@DataMites Před 2 lety
Keep watching.
@jagannadhareddykalagotla624 Před 2 lety
DataMites is like hidden pattern in unsupervised learning thank you so much ashok❤️❤️
@DataMites Před 2 lety
Thank you!
@AsiaMSaeed Před 2 lety
Amazing. Thanks a lot.
@DataMites Před 2 lety ⁺¹
You are welcome!
@lalithapriya9484 Před 3 lety
extreme clarification really superb teaching skills along with good communications
@DataMites Před 3 lety
Hi lalitha priya, thank you for you comment.
@canancetin7897 Před 3 lety
Great video! Thanks a lot!!!
@DataMites Před 3 lety
Glad you liked it!
@samhugh9891 Před 3 lety
great video, thank you!
@DataMites Před 3 lety
You are welcome!
@dikshitlenka Před 3 lety
Very clear explanation. Thanks
@DataMites Před 3 lety
You are welcome!
@sabbirahmmed7161 Před 2 lety
Thanks, nice explanation
@DataMites Před 2 lety
You are welcome
@ffckode Před 4 lety
Thanks for sharing. Very helpful
@DataMites Před 3 lety
Glad it was helpful!
@adeyinkasotunde6870 Před 4 lety ⁺¹
wow...... i am very well impressed. well explained. thanks
@DataMites Před 3 lety
You are most welcome
@nehaurade4917 Před 3 lety
Perfect video..thank you
@DataMites Před 3 lety
You are welcome!
@parsayadpa5446 Před 2 lety
thanks alot for this good tutorial.
@DataMites Před 2 lety
You are welcome!
@MrMehshankhan Před 3 lety
thank you so much man. great thumbs up...
@DataMites Před 3 lety
You're welcome!
@b1k1m1 Před 4 lety ⁺²
Hello Sir, Thanks for explaining this very clearly.. keep it up....
@DataMites Před 3 lety
You're most welcome
@aftabnaseem Před 3 lety
Great job ....made it look very easy
@DataMites Před 3 lety
Thanks you 👍
@michaelpanashemudimbu7405 Před 3 lety
Awesome video
@DataMites Před 3 lety
Glad you enjoyed it
@wenshanpan8726 Před 3 lety
Excellent!
@DataMites Před 3 lety
Thank You!
@nasreenbanu2245 Před 2 lety
Hai sir! thanks a lot for very simple and clear explanation.keep going we expect more videos from you...
@DataMites Před 2 lety
Keep watching
@dewipurnamasari5814 Před rokem ⁺²
Thank you very much
@DataMites Před rokem
Most welcome! Keep Watching
@mozaffarhussain5496 Před 4 lety
Best Explanation sir ..............!
@DataMites Před 3 lety
Keep watching
@ombb3576 Před 2 lety
Thank you for your sincere lecture sir
@DataMites Před 2 lety
You are most welcome
@heenagirdher6443 Před 3 lety
Great tutorial. Very good explanation sir.
@DataMites Před 3 lety
Glad you liked it
@jongcheulkim7284 Před 2 lety
Thank you so much. ^^
@DataMites Před 2 lety
You're welcome 😊
@patrickbormann8103 Před 3 lety
Amazing!
@DataMites Před 3 lety
Thanks!
@AMITSHARMA-fy4wv Před 3 lety
Really appreciate sir..Lot off.🙏🏼🙏🏼🙏🏼🙏🏼🤗🤗🤗👌👌👌👌😊😊😊😊
@DataMites Před 3 lety
Thank you!
@zakariaabderrahmanesadelao3048 Před 4 lety
what a crystal clear explanation. thank you.
@DataMites Před 3 lety
You're very welcome!
@akshayjadhav2213 Před 3 lety
very nicely explained sir ..thank you
@DataMites Před 3 lety ⁺¹
You are most welcome
@svitirur1665 Před 3 lety
very good explanation
@DataMites Před 3 lety
Keep watching
@sasidharansathiyamoorthy6918 Před 3 lety
Thank you for the informative video! In this video, you have used SMOTE to rectify imbalance in target label. What methods can we use to deal with class imbalance in categorical features( input) in order to make the model more robust?
@DataMites Před 3 lety ⁺¹
Hi Sasidharan Sathiyamoorthy, Its property of input so if u balance the input it might affect the target variable. Make 2 models with and without balancing n check the performance
@RoyalRealReview Před 2 lety
@@DataMites sir if we have 54% persons cancer patients and 46% non-cancer patients then do we need balancing? If yes then which balancing technique should be selected?
@sandeshbapu1567 Před 4 lety
Nicely explained
@DataMites Před 3 lety
Thank you so much 🙂
@abhijitkamune3976 Před 4 lety
Nice explanation .. Looking for more NLP related video
@DataMites Před 3 lety
Sure
@athilakshmir8589 Před 3 lety
nice explanation
@DataMites Před 3 lety
Thank You!
@shivki23 Před 4 lety
subscribed for ur content
@DataMites Před 3 lety
Thank you
@lavanyanayak8707 Před 3 lety
Thank you very much for this video. I have a precipitation dataset containing 4 columns and 8000 rows, each of them has a lot of zeros and only a few continuous values. I would like to know if I can use smote in this case?
@DataMites Před 3 lety ⁺¹
Hi Lavanya Nayak
, Github link is provided in the description. please check it out.
@mohan250s Před rokem
ur awesome
@DataMites Před rokem
Thank you.
@perusona_desu5534 Před rokem
in oversampling do you have to make the minority class instances equals the majority class instances ?
for example:
can it be 900 nc
and 800 c
@DataMites Před rokem
Oversampling is increasing the samples for minority class to match with the majority class. Undersampling is reducing the samples for majority class to match with minority class.
@anaghadamame196 Před 3 lety
Thank you sir...👍
@anaghadamame196 Před 3 lety
Can you explain which algorithm should be selected for regression problem....it will help me alot
@DataMites Před 3 lety
All the best
@insidiousmaximus Před 3 lety
great video thank you. I am trying to figure out how to use this with a generator flowing from directory?
@DataMites Před 3 lety
"Hi
insidiousmaximus, thanks for reaching us with your query.
Can you please put your query more precisely so that we can help you?"
@cliffordtarimo1511 Před 3 lety
Great video on SMOTE. Do you have a video on undersampling? Can someone perform both undersampling and oversampling in one line of code??? THANKS.
@DataMites Před 3 lety
The other flavor of SMOTE is SMOTETOMEK which uses undersampling of majority class and upsamping of minority class.
@chinedumjoseph9875 Před 3 lety
Oh! I got it. Don't worry. Thanks
@DataMites Před 3 lety
You're welcome
@inspiritlashi9994 Před 2 lety
Hi, can I know how did you correct it? i got the same error message
@muhammedalisahan9661 Před rokem
Firstly, Thank you for sharing. I wanna ask something about time series. I have lots of data. But datas are different frequency. I wonder how deal with all datas. And assume that datas edited to same frequency. By the way datas are not fitted normal distribution so imbalanced that's why i am asking. If datas be same frequency, Smote can be appliable for time series? If not how to resample my time series?
@seeutube8860 Před rokem
Nice video.
After applying smote, balanced data was obtained. But balanced data (X_smote,y_smote) was not split (80:20) in to train n test data sets before reapplying classification model?
Is it necessary or not to split the data again? Or orginal dataset itself was considered as test dataset.
@DataMites Před rokem
We have already split and then we balanced the data. So not required to split again.
@niswandi6122 Před rokem ⁺¹
Thank you ashok, clear explanation, but howto handle the imbalanced datasets if we have 4 classes?
@DataMites Před rokem
For multiclass also same technique is applied as that of 2 classes
@OriginalBernieBro Před 4 lety
Running into a problem with sklearn 'support' column still looking unbalanced after smoting on print(classification_report(y_test, y_pred)) what gives?
@DataMites Před 3 lety
The support is the number of samples of the true response that lie in that class.
@kurniawandk5078 Před 2 lety
Very informative, i have a question sir, it is possible to set how many synthetic data created by smote ? in example i want to set n_sample increase to 200% so, how to put this parameters in pyhton code ?
@DataMites Před 2 lety
Your question is not clear. Can you elaborate plz?
@swastiknayak5173 Před 3 lety
At 8.15 you have said it is taking the average of centroids which is completely wrong. SMOTE is calculated over the feature space...it goes like this
1. we take the feature vector of the minority class point.
2. we calculate the distance between the neighbours (neighbours=5).
3. we multiply the distance between the neighbours with a random number that is created between 0 &1.
4. Then we create the synthesized point.
hope you got it 😀
@DataMites Před 3 lety
SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbours for that example are found (typically k=5). A randomly selected neighbour is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.
@dkandasamypandian719 Před 3 lety
Good
@DataMites Před 3 lety
Thank You!
@ShubhamKumar-id6pf Před 3 lety
SIr, I went on as per the recommended procedures but my jupyter environment giving an AttributeError that SMOTE object has no attribute '_validate_data'.
Can you please help me with the.
@DataMites Před 3 lety ⁺¹
You need to upgrade scikit-learn to version 0.23.1.
@kunalgoyal8529 Před 4 lety
While dividing training and test data shouldn't you be doing "stratify=y" ? To ensure test data and training data set have equal proportion of outcome variable?
@mr.techwhiz4407 Před 3 lety
that would be undersampling
@DataMites Před 3 lety
The aim of machine learning model is to generalization on training set so that performance on unseen
Data is good.We don't care what the test data consist instead we try to given more generalized pattern to the algorithms.
@younesgasmi8518 Před 4 měsíci
Thanks so much bro..i have shown some data scientists used undersampling and oversampling before Splitting the dataset into training and testing..in my research paper we heve used NEARMISS technique to balance the dataset..i have got a good results with using cross validation Splitting and Extra tree classifier as model and also the same model to select the best importance features where my results are : (ACC 0.97 , F1 0.97 and AUC 0.99) are there results may be accepted for publishing?
@DataMites Před 4 měsíci ⁺¹
You achieved good results. However, whether your results are acceptable for publishing depends on several other factors too.
@tahanics901 Před 2 lety
Very good explanation Thanks. but this code, is applicable with text data (tweets) or not?
@DataMites Před 2 lety
yes after converting text to numerical vectors. use fit_resample()
@rengarajramanujam6499 Před 3 lety
Good....
@DataMites Před 3 lety
Thank You!
@hendripriyambowo1427 Před 3 lety
hi sir i have question how did we implement those resampling technique in neural network, let say if we implement embedding layer and work with multiple kind of data
is that resampling technique make our data losing such information?
@DataMites Před 3 lety
You can use mini-batch SGD optimizer to handle imbalance dataset.
@ringgaershaikhwani3478 Před rokem ⁺¹
hello sir, the material that you explain is very easy to understand. I want to ask about my project. I have imbalanced data, then I do smote and I model it with KNN, but why after smote does the accuracy go down? 79% to 78%, is there something wrong with my data? Can you help explain this? I am very grateful if you respond to my comment.
@DataMites Před rokem ⁺¹
Using SMOTE, your model will start detecting more cases of the minority class, which will result in an increased recall, but a decreased precision. Accuracy is not a good measure of performance on unbalanced classes. That's because SMOTE technique puts more weight to the small class, makes the model bias to it. The model will now predict the small class with higher accuracy but the overall accuracy may decrease.
@amruthakommu4695 Před 2 lety ⁺¹
Great Ashok. That was a well explained video. I tried the same thing on my data set but my accuracy came down from 94 to 86. What could be the cause?
@DataMites Před 2 lety
Hi, we cannot comment until we look in your data and all the approaches that you have taken. One of the possibility might be your prediction was previously overfitted.
@petersq5532 Před 2 lety
how split stratify solves the problem?
@chinedumjoseph9875 Před 3 lety
Thank you for this nice explanation. I was making progress with the codes but when I tried to fit using the command X_train_smote, y_train_smote = smote.fit_sample(X_train.astype('float'),y_train), I got error saying AttributeError: 'SMOTE' object has no attribute 'fit_sample'. I need urgent help please. Thank you
@DataMites Před 3 lety
Hi Chinedum Joseph, can you please list the version of python and scikit learn in your system?
@ObaidoGeorge Před rokem
Use smote.fit_resample instead of smote.fit_sample.
@AbdulLatif-fu9jz Před rokem
@@ObaidoGeorge Tqvm for your help
@ishan7491 Před 2 lety
Can you please explain this part of the code in the label encoder section:
@DataMites Před 2 lety
Hi Ishan, please reframe your query.
@faisalshehzad9504 Před 4 lety
thanks.
@DataMites Před 3 lety
Welcome!
@rukaiyaa191 Před 2 lety
which module is used for alternative module of imblearn in python sir(for handling imbalance dataset)
@DataMites Před 2 lety
For balancing the dataset we have only imblearn module. But there are other ways to deal with the imbalanced dataset.
@abhimynampati2929 Před 2 lety
Hey Ashok, can u make a video on dsste algorithm for removing class imbalance?
@DataMites Před 2 lety ⁺¹
Will do in future session.
@abhimynampati2929 Před 2 lety
@@DataMites awesome! Will be waiting.
@inspiritlashi9994 Před 2 lety
Sir,
Can I know how to run a logistic regression on the oversampled dataset?
@DataMites Před 2 lety
Hi Inspirit Lashi, you can use SMOGN for preprocessing of your dataset. More more information: proceedings.mlr.press/v74/branco17a/branco17a.pdf
@terryterry3733 Před 3 lety
Hi sir what is the data type for outcome ? i think it is in object . Did u convert that into float or int?
@DataMites Před 3 lety
"Hi Terry, thanks for reaching to us regarding your queries.
Outcome datatype is in the string and we label encoded it to an integer."
@JainmiahSk Před 4 lety ⁺¹
you haven't encoded the target variable?
@DataMites Před 4 lety
Target variable needn't require encoding
@patelajay1010 Před 3 lety
I have one doubt. What if data contains Nan values and you want to do under_sampling? If you impute Nan values with Mean() then there will be information leakage because we impute data before splitting it into train and test dataset. Could you please tell me what should be the possible solution in this case?
@DataMites Před 3 lety ⁺¹
Hi
Ajay Patel, if you have a large dataset, you can certainly drop the Nan Values
@patelajay1010 Před 3 lety
@@DataMites Sir I have continuous data coming from sensors. Dropping few rows will lead to break a pattern.
@DataMites Před 3 lety
@@patelajay1010 In that case without knowing the source and significance of your nan value, we cannot comment on anything.
@patelajay1010 Před 3 lety
@@DataMites ok sir. Thank you for your response.
@sushmithajanapati7785 Před 2 lety
Does Smote algorithm support Multi output classification?
@DataMites Před 2 lety
Yes, you can use SMOTE.
@snehasamadder3790 Před rokem
after I resample an imbalance dataset how can I download the resampled dataset from colab?
@DataMites Před rokem
Combine the resampled x and y and create a new dataframe, then convert that dataframe to a csv file using to_csv()
@HarishKumar-qj9pp Před 3 lety
getting attribute error: 'SMOTE' object has no attribute 'fit_sample' but I have all the packages requirement satisfied still showing the error
@DataMites Před 3 lety
Hi please check imbalanced-learn.org/stable/over_sampling.html for any update in imbalance learn package
@sunnyarora4916 Před 3 lety
Any video where we use SMOTE for regression??
@DataMites Před 3 lety ⁺¹
Hi Sunny Arora, you can use SMOGN for it. More more information: proceedings.mlr.press/v74/branco17a/branco17a.pdf
@sunnyarora4916 Před 3 lety
@@DataMites Thank you, is it less likely to use SMOGN?
@sanyajain2127 Před 3 lety
Getting an error: ValueError: Unknown label type: 'continuous-multioutput'
@DataMites Před 3 lety
It can due to multiple reasons like in logistic-regression doing classification more than 2 classes.
Or due to the use of classifier if the target variable is continuous.
@wajeehanaz9115 Před 2 lety
Hello Sir!
can you please tell me how to generate images using smote technique ???
Thanks in advance...
@DataMites Před 2 lety
For image generation we have a different method called Data Augmentation it will newly create synthetic data from existing data.
@vivekuk4329 Před 3 lety
hi sir need to join in ur classes how to approach you
@DataMites Před 3 lety
Hi Vivek uk , please share your email id and contact number. Our educational counselor will share the details. You can contact our counselor directly at 18003133434. For more info datamites.com/
@wajeehanaz9115 Před 2 lety
Thank you for informative video! I used your coding but got error "
ValueError: could not convert string to float: '5more'"...plz tell me how can I resolve this error...Thanks in advance:)
@DataMites Před 2 lety
We have to look into your code. But please check if you have converted all the categorical values to numerical values in your dataset.
@RoyalRealReview Před 2 lety
@@DataMites sir I am predicting heart disease and out of my sample 54% people have heart disease and rest 46% don't have so which method I should use for balancing?
@oumaimasouid5229 Před 3 lety
i find this error >> plz help !
@DataMites Před 3 lety
Hi, please use fit_resample
@datascientist2958 Před 3 lety
Sir how can we adjust ratio and what's behind it
@DataMites Před 3 lety
If you are asking for class ratios of target variable to be called as imbalance then it can 90:10,80:20,70:30.

Další v pořadí

Automatické přehrávání

Handling Imbalanced Dataset in Machine Learning: Easy Explanation for Data Science Interviews