Handling Class Imbalance Problem in R: Improving Predictive Model Performance | Unbalanced Dataset

Sdílet
Vložit
  • čas přidán 3. 07. 2024
  • Provides steps for carrying handling class imbalance problem or datasets that are unbalanced when developing classification and prediction models
    R file: github.com/bkrai/R-files-from...
    data: binary.csv available from github link above
    Timestamps:
    00:00 Introduction
    00:05 Admit Data
    01:37 What is class Imbalance Problem ?
    03:26 Data Partition
    04:17 Data for Predictive Model
    05:28 Prediction Model - Random Forest
    06:16 Model Evaluation with Test Data, confusion matrix
    12:24 Oversampling for Better Sensitivity
    16:13 Undersampling
    18:19 Both Oversampling and Undersampling
    20:08 Synthetic sampling using random over sampling examples
    predictive models are important machine learning and statistical tools related to analyzing big data or working in data science field.
    R is a free software environment for statistical computing and graphics, and is widely used by both academia and industry. R software works on both Windows and Mac-OS. It was ranked no. 1 in a KDnuggets poll on top languages for analytics, data mining, and data science. RStudio is a user friendly environment for R that has become popular.

Komentáře • 231

  • @subterrain5293
    @subterrain5293 Před 6 lety +4

    I like the way your lectures are so crisp. It gives a first-hand experience to those looking to learn these techniques by doing hands-on.

    • @bkrai
      @bkrai  Před 6 lety

      Thanks for the feedback!

  • @abhishek894
    @abhishek894 Před 2 lety +1

    Thank you Dr. Rai for sharing this video.

    • @bkrai
      @bkrai  Před 2 lety

      You are welcome!

  • @flamboyantperson5936
    @flamboyantperson5936 Před 6 lety +7

    Sir you have so many great videos it is increasing my knowledge everyday. Thank you so much. You are the best Professor of Statistics I have ever come across.

  • @ritwikbasu9837
    @ritwikbasu9837 Před rokem +2

    Too Good a Lecture. Thank You Dr. Rai

    • @bkrai
      @bkrai  Před rokem

      You're most welcome!

  • @manjunathjangama
    @manjunathjangama Před 6 lety +4

    Great explanation sir.explaining it to the minute details with very simple explanation is awesome feature you had.Appreciate if you could continue this journey with more important topics

    • @bkrai
      @bkrai  Před 6 lety +1

      Thanks for your feedback! You can find some useful playlists on the channel. Here is one example:
      czcams.com/play/PL34t5iLfZddu8M0jd7pjSVUjvjBOBdYZ1.html

  • @asthamalhotra2345
    @asthamalhotra2345 Před 4 lety +1

    Thank you 1 million times...learnt a lot in 30 mins

    • @bkrai
      @bkrai  Před 4 lety

      Thanks for comments!

  • @Guavarosa
    @Guavarosa Před 4 lety +1

    Quality videos! I appreciated! Very educative!

    • @bkrai
      @bkrai  Před 4 lety

      Thanks for comments!

  • @gabrielmurarideandrade5755

    Thanks a lot! You helped me in econometrics class. From Brasil.

    • @bkrai
      @bkrai  Před 2 lety

      You're welcome 😊

  • @statisticalworld1133
    @statisticalworld1133 Před 3 lety +1

    Thanks Sir for your valuable lectures. Sir you indeed teaches with practical in R. May happy always and long live.

    • @bkrai
      @bkrai  Před 3 lety +1

      You are most welcome!

  • @parasrai145
    @parasrai145 Před 6 lety +2

    This is very useful and very well explained 👍

    • @bkrai
      @bkrai  Před 6 lety

      Thanks for comments!

  • @sargamgupta7194
    @sargamgupta7194 Před 6 lety +1

    Towards the end.. I paused to see where the music is coming from... it started way too early..
    This video was the answer to all my questions!! explained so well.. Thank you

    • @bkrai
      @bkrai  Před 6 lety

      +Sargam Gupta 🙂

  • @asterIcaro
    @asterIcaro Před 5 lety +1

    Thank you very much for your amazing videos!!!

    • @bkrai
      @bkrai  Před 5 lety

      Thanks for comments!

  • @adityapatnaik7078
    @adityapatnaik7078 Před 6 lety +3

    Excellent explanations!!!! plz make more videos on machine learning

    • @bkrai
      @bkrai  Před 6 lety

      Classification and Prediction with R - you can find some machine learning lecture videos from this link:
      Statistical & Machine Learning Methodologiesczcams.com/play/PL34t5iLfZddu8M0jd7pjSVUjvjBOBdYZ1.html

  • @abcdef-zb7qs
    @abcdef-zb7qs Před 5 lety +1

    Amazing Video!!!! Thanks sir really! It helped me a loooootttt

    • @bkrai
      @bkrai  Před 5 lety

      Thanks for feedback!

  • @thejll
    @thejll Před rokem +1

    Thanks for the large font!

    • @bkrai
      @bkrai  Před rokem

      You are welcome!

  • @JackDaniels-ei1ds
    @JackDaniels-ei1ds Před 5 lety +1

    Your videos deserve tens of thousands of likes. Kudos for excellent material and popularizing my favorite programming language, R.

    • @bkrai
      @bkrai  Před 5 lety

      Thanks for comments!

  • @Raja-tt4ll
    @Raja-tt4ll Před 5 lety +1

    Thanks, it is very helpful.

    • @bkrai
      @bkrai  Před 5 lety

      Thanks for comments!

  • @artbyrhiamie8962
    @artbyrhiamie8962 Před 3 lety +1

    THANK YOU!

    • @bkrai
      @bkrai  Před 3 lety

      You're welcome!

  • @ashishsangwan5925
    @ashishsangwan5925 Před 5 lety +1

    Awesome Explanation

    • @bkrai
      @bkrai  Před 5 lety

      Thanks for comments!

    • @ashishsangwan5925
      @ashishsangwan5925 Před 5 lety

      @@bkrai I do have one question. You are just applying sampling technique (over, under, both, rose) only on train data, building model and validating on test data. Why are you not applying sampling technique on test data ? Is there no need to balance test data as well before validate model on it?

    • @bkrai
      @bkrai  Před 5 lety

      Test data is like any new data that will be used for prediction. New data points are not likely to come balanced.

    • @ashishsangwan5925
      @ashishsangwan5925 Před 5 lety

      @@bkrai Thanks for your reply. I have read SMOTE is also used to handle imbalanced data.
      I do have below questions, I would be thankful if you will reply
      1. Both ROSE and SMOTE work similarly ( I mean internal calculation ) ? If not then which one is good. ?
      2. Which one among ROSE and SMOTE would you prefer ?
      3. Do you have any video on SMOTE ?

  • @gnavdeep1
    @gnavdeep1 Před 7 lety +4

    Excellent video Sir... keep them coming... can you do videos with examples of various functions in Caret especially with large datasets and prediction with xgboost, e1071 packages...thanks

    • @bkrai
      @bkrai  Před 6 lety

      Thanks for the suggestions!

  • @SaptarsiGoswami
    @SaptarsiGoswami Před 6 lety +4

    Thank you so much Professor, Very lucidly explained and you have kept the data and code available which is so very useful. Wanted to know you if there are some other way of imbalanced class like cost sensitive classifier etc?

    • @bkrai
      @bkrai  Před rokem

      Sorry seeing this now. I hope you already figured out.

  • @shaahin6818
    @shaahin6818 Před 6 lety +13

    I love your videos. Thanks. Just one point: ROSE and over/under sampling are two different approaches. The former is based on bootstrapping, the latters are more traditional. You used traditional approaches to the problem. Besides, the 30% success is not "rare event". It would be better to use a dataset with 5% or lower success rate.

    • @bkrai
      @bkrai  Před 6 lety +1

      Thanks for the feedback!

    • @ishimwejeanpaul490
      @ishimwejeanpaul490 Před 4 lety

      Can we say that we have an imbalanced data when success event is of 5% or lower rate?

  • @mearitutun
    @mearitutun Před 5 lety +1

    Awesome

  • @Viewfrommassada
    @Viewfrommassada Před 6 lety +2

    Prof Rai, your videos have been the best! Could you please do a video on XGBoost?

    • @bkrai
      @bkrai  Před 6 lety

      You can access it from this link:
      czcams.com/play/PL34t5iLfZddu8M0jd7pjSVUjvjBOBdYZ1.html

  • @kavyashree228
    @kavyashree228 Před 5 lety +1

    Best Video

    • @bkrai
      @bkrai  Před 5 lety

      Thanks for comments!

  • @user-hi7ee5wu9z
    @user-hi7ee5wu9z Před 2 lety +1

    Hello sir, is the oversampling method in this video using the smote algorithm? If not, what is the difference between the two?

  • @Zukit3
    @Zukit3 Před 4 lety +1

    Hello Dr. Bharatendra,first, thank you for the explanation, you're english is easy to understand. In this case, I don't know why the ROSE function doesn't work for me, because when Run the line, for example, to oversampling the train, the variable 'over' is NULL (empty), but I can solver this with Caret Packages.

    • @bkrai
      @bkrai  Před 4 lety

      If caret works, that's fine too.

  • @dr.bheemsainik4316
    @dr.bheemsainik4316 Před 2 lety

    Thank you sir for one more valuable lecture. Sir, can we do, Random over-sampling (1:2), randomly selecting minority samples with replacement and adding them into the training data set with bootstrap?

  • @razorbrahman2133
    @razorbrahman2133 Před 6 lety +3

    Awesome video!
    If you could let me know how to implement the same when prediction model is a neural network, that would be great. Thank u.

    • @bkrai
      @bkrai  Před 6 lety

      For neural networks, you can use this link:
      czcams.com/video/-Vs9Vae2KI0/video.html

  • @niv2419
    @niv2419 Před 6 lety +2

    Thank you for another great video sir!
    Also, you mentioned under synthetic data that we can use ‘attributes’ to make sure that we don’t go outside the range (in GPA & Rank). Can you please touch upon these attributes?
    Looking forward to hearing from you!
    Thank you!

    • @bkrai
      @bkrai  Před rokem

      Sorry seeing this now. I hope you already figured out.

  • @Adityasharma-zb7no
    @Adityasharma-zb7no Před 6 lety +6

    Hello Sir, thank you so much for such a nice video. JUst wanted to know, the step you used for synthetic data, that process is SMOTE only right?

    • @bkrai
      @bkrai  Před 6 lety +1

      ROSE and SMOTE work slightly differently. But both help to address class imbalance problem.

    • @Adityasharma-zb7no
      @Adityasharma-zb7no Před 6 lety +1

      Thanks Sir for your prompt action.

  • @bobdylan021911
    @bobdylan021911 Před 6 lety +5

    Thank-you for this video! I've watched a number of your videos and they make things so straightforward and easy to pick up. Is there any way to tweak this method for dealing with a factor with more than two levels - I'm looking at 9 different levels and keep on getting errors with the function shown in this video.

    • @bkrai
      @bkrai  Před 6 lety

      You can take subsets with 2 levels at a time where class imbalance is present and apply this method. And finally you can combine your data.

    • @VinayKumar-jf7pr
      @VinayKumar-jf7pr Před 6 lety

      could you please share the code if you sorted out this problem? I am looking at 5 different levels and I stuck to continue the project.
      avinaykumar03@gmail.com
      Thanks in Advance!

    • @biswajitdash3855
      @biswajitdash3855 Před 5 lety

      @@VinayKumar-jf7pr did you get an ans for this?

  • @debasishmishra1638
    @debasishmishra1638 Před 3 lety +1

    Again a beautiful explanation. Sir I wish to ask you what if there is class imbalance in the validation data set but not the training set? For example, suppose we want to evaluate a model which we have developed using local responses to see wether it performs well globally/in another province.. But we find the dataset distribution to be highly skewed thereby giving rise to class imbalance problem..and when we apply our model it gives lower kappa values.. What is the best way out? As i was reading the caret package details where it has been advised not to use upscaling/downscaling on validation data set! (but we have show the models work)

    • @bkrai
      @bkrai  Před 3 lety

      It is only used for training data. Validation data represents unseen data that the model has to deal with. So validation data should be kept as it is.

  • @sillytechy
    @sillytechy Před 7 lety +1

    so we increase the models accuracy in predicting "1" as that was the questions interest . what about the predictors which have larger influence on "admit" .How to know which predictors are significant. should we use logit regression for that.

    • @bkrai
      @bkrai  Před 7 lety

      That's correct, for statistical significance of predictors you can relay on the logit regression model.

  • @vishnukowndinya
    @vishnukowndinya Před 7 lety +2

    thx for the nyc video sir.
    can we use this for logit model as well ? or only for randomforest ?
    so, we can attain acc by doing ovun. one small doubt, in this dataset we have 70% of data with "0" and 30 with 1. so we did ovun and gained increase in acc. "what is the best proportion of 0s-1s inorder to get high acc? i mean any bench mark is there, like 50-50, 40-60... for 0s,1s or yes/no ???
    Is it ok if overall acc is reduced, inorder to increase the sensitivity ???? i have used N=400 instead of 376, which made 0=188, 1=212, so sensitivity=0.63. can i do like this ??

    • @bkrai
      @bkrai  Před rokem

      Sorry seeing this now. I hope you already figured out.

  • @claudinaskate
    @claudinaskate Před 4 lety +1

    Hi, i want to ask you a question: can i use these methods only for constructing regression models and evaluating all the explanatory variables? thanks

  • @Adityasharma-zb7no
    @Adityasharma-zb7no Před 6 lety +3

    Hello Sir, very well explained, but i just wanted to know, in all the sampling method we got accuracy not more than 60%, will there not be any problem with our Model if we apply the same model using future data of the same dataset?

    • @bkrai
      @bkrai  Před 6 lety

      It can only help in improving overall accuracy to some extent. Doing oversampling or under sampling in when there is significant class imbalance does not guaranty very high accuracy because it totally depends on what data you are using.

  • @MrPraveen2305
    @MrPraveen2305 Před 4 lety +1

    Hi Sir,
    Wantedd to check ifin a Logistic Regression problem both dependent variable and independent variable are dichotomous in nature and there is imbalnce data in both the cases , then what is the best way to treat the imbalance data present in both IV and DV

    • @bkrai
      @bkrai  Před 4 lety

      DV may have imbalance because of IV. If you focus on IV, that should be enough.

  • @sudanmac4918
    @sudanmac4918 Před 4 lety +2

    Sir nicely explained. i have doubt if we have more than two classes like multi nominal regression ROSE algorithm not working. how to rectify that error??

    • @bkrai
      @bkrai  Před 4 lety +1

      With more than 2 classes, choose 2 of them that need improvement and apply the method.

  • @inspiritlashi9994
    @inspiritlashi9994 Před 2 lety

    Thank you so much for this tutorial.
    I followed these steps in my dataset.
    bt at the end I got the same confusion matrix for train, under, over and both data. and my accuracy, kappa, sensitivity, specificity etc. are 1, while the Mcnemar's Test P-Value is NA.
    Sir, could you please help me to correct this?

  • @nothing8919
    @nothing8919 Před 3 lety

    Well thank you for the explanation
    it's the first time i use this package, and i don't know what the difference from using rose() to balance the data
    and using ovun sample()?
    what i'm looking for is to balance my data using ROSE from Menardi and Torelli.

  • @VinayKumar-jf7pr
    @VinayKumar-jf7pr Před 6 lety +2

    what if I have more than two categorical variables?
    I am getting this error when performing undersampling
    "Error in (function (formula, data, method, subset, na.action, N, p = 0.5, :
    The response variable must have 2 levels" please help me out. TIA

    • @bkrai
      @bkrai  Před rokem

      Sorry seeing this now. I hope you already figured out.

  • @SandeepKumar-me6qr
    @SandeepKumar-me6qr Před 5 lety +1

    Thank you for the explanation sir.. In this data we have factors as 0 and 1. How to handle the imbalance if we have more than 2 factors in the data set?

    • @bkrai
      @bkrai  Před 5 lety

      You can do it two at a time and repeat.

  • @akashprabhakar6353
    @akashprabhakar6353 Před 4 lety +2

    Thanks for this awesome video sir. I have few doubts:
    1. How can we set some attributes to keep rank and gpa within possible range in synthetic data...how to write that condition?
    2.Whats the diff bw both over and undersampling together and the synthetic data we prepared at the end??
    3. Why have we used positive= '1' in rf formula, in previous video of yours I haven't seen such thing

    • @bkrai
      @bkrai  Před 3 lety

      1. If it is not in the algorithm, you can manually do so before developing a model.
      2. Together it does oversampling where samples are smaller and under sampling where number are of cases are higher.
      3. That indicates what level of response we are more interested in.

  • @dr.bhavinapatel5271
    @dr.bhavinapatel5271 Před 4 lety +1

    Great video sir, Is there any video regarding the multi-label classification with a validation dataset based on the training dataset model and apply only test dataset. thank you, sir.

    • @bkrai
      @bkrai  Před 4 lety +1

      Try this:
      czcams.com/play/PL34t5iLfZddvv-L5iFFpd_P1jy_7ElWMG.html

  • @lorenzwagner9288
    @lorenzwagner9288 Před 4 lety

    Thanks for the great Video. How do i calculate F1 out of those results?

  • @netmarketer77
    @netmarketer77 Před 4 lety +1

    The RF model is built using balanced train dataset, but the prediction is used unbalanced test dataset? Should we balance the test dataset as well?

    • @bkrai
      @bkrai  Před 4 lety

      No we should not balance test data as the model is already built.

  • @netmarketer77
    @netmarketer77 Před 4 lety +1

    let us suppose we have 205 instances for class 0 and we want to use over sample method. So the resulting over sample data points are 410 for both classes . Is that acceptable since we have original data points only 400 instances?? thanks.

    • @bkrai
      @bkrai  Před 4 lety +1

      If you have 400 observations and 205 are class-0, you don't really have class imbalance problem.

  • @armaan4909
    @armaan4909 Před 5 lety +1

    Thanks a lot for giving me MOOC knowledge..
    Dear sir,
    I really love your way teaching
    Could you please me music link that you have used in your video

    • @bkrai
      @bkrai  Před 5 lety

      Here is the link:
      drive.google.com/open?id=1wOOjoEr3Y8QyoWS7V5X_9KQ2rrtezpDZ

  • @ondsport
    @ondsport Před 5 lety +1

    Hello sir, how is this method different from using cross validation in caret package??
    Do i still need to do this if i have intention of generalizing my predictive power using cross validation??

    • @bkrai
      @bkrai  Před 5 lety

      Cross validation helps with generalization. Addressing class imbalance helps with giving proper weight to each class of the categorical dependent variable. So they serve different purpose.

  • @nithinmamidala
    @nithinmamidala Před 5 lety +1

    Thank you, sir. it is very helpful... I need a binary dataset for practice. can you please upload.

    • @bkrai
      @bkrai  Před 5 lety

      Data file: goo.gl/D2Asm7

  • @manasarath4146
    @manasarath4146 Před 6 lety +4

    I believe this library can be used to for most of the classifiers such as Logistic Regression, SVM, and not just limited to Random Forest?

  • @neerajraut6473
    @neerajraut6473 Před 6 lety +2

    What if there are more than 2 factors in the dependent variables.how to deal with class imbalance there?

    • @bkrai
      @bkrai  Před 6 lety +2

      You can take subsets with 2 levels at a time where class imbalance is present and apply this method. And finally you can combine your data.

  • @caamitjaiswal
    @caamitjaiswal Před 4 lety +1

    Hi sir, thanks. Really explained well, are there any formal courses on Analytics for finance professionals which you will recommend?

    • @bkrai
      @bkrai  Před 4 lety

      You can try this:
      czcams.com/play/PL34t5iLfZdduGEuSXYrleeBdvfQcak0Ov.html

  • @send2milan
    @send2milan Před 5 lety +1

    Sir,
    How can we use the technique for multi class classification ? Example : NSP data.

    • @bkrai
      @bkrai  Před 5 lety

      You can do it two at a time.

  • @biswajitdash3855
    @biswajitdash3855 Před 5 lety +1

    How to tackle data imbalance in multi level classification problem? Any links describing the same in R would be of great help! For ex, if data set is varied (target var: class1 ~ 100 samples, class2 ~ 1000 samples, class3 ~ 10000 samples, class4 ~ 20000 samples)

    • @bkrai
      @bkrai  Před 5 lety

      You can do two at a time.

  • @kevinm8607
    @kevinm8607 Před 6 lety +1

    can this model be used to make a model for recommendation system (collaborative filtering) many thanks in advance for your reply.

    • @bkrai
      @bkrai  Před rokem

      Sorry seeing this now. I hope you already figured out.

  • @kapilrana1153
    @kapilrana1153 Před 2 lety +1

    Namaste!
    According to you which method do you think is the best ?

    • @bkrai
      @bkrai  Před 2 lety

      It depends on method that gives best results.

  • @caamitjaiswal
    @caamitjaiswal Před 4 lety +1

    Hi sir, great videos, various all R related are helping me a lot. Need help on finance and fraud analytics. Please can you post some finance domain related courses.

    • @bkrai
      @bkrai  Před 4 lety

      Thanks for the suggestion, I've added it to my list.

  • @randulajayasinghe8237
    @randulajayasinghe8237 Před 4 lety +1

    Would you please do a video on how to do SMOTE using R

    • @bkrai
      @bkrai  Před 4 lety

      Thanks, I've added it to my list.

  • @qualitytoolbox4872
    @qualitytoolbox4872 Před 4 lety +1

    Hi Can I do a Chi-Square test for Binary Responses to see whether the two classes are Uniformly dstributed or skewed (imbalanced). Thanks

    • @bkrai
      @bkrai  Před 4 lety

      You may try this:
      czcams.com/video/1RecjImtImY/video.html

  • @aishwarygupta6765
    @aishwarygupta6765 Před rokem +1

    Sir, I have a small doubt. what if we have a multinomial logit model, how do we partition the data then?

    • @bkrai
      @bkrai  Před rokem

      You can do two at a time.

  • @milindshende2525
    @milindshende2525 Před 3 lety +2

    Sir, Can we use ROSE / SMOTE method for target variable with class more than 2. If yes ; then could you pl suggest what other parameters should we use. I tried with parameters mentioned in this video ; but getting error claiming the class is more than 2.

    • @bkrai
      @bkrai  Před 3 lety

      You can do it by doing 2 at a time.

    • @abeerharuray7147
      @abeerharuray7147 Před 3 lety +1

      @@bkrai Could you please explain it with a sample code to explain. We are predicting severity with levels 1

  • @sumaiyasande150
    @sumaiyasande150 Před 4 lety +1

    Hi Sir, what if we simply sample the no. of observations from majority class equal to minority class without disturbing minority class and without using ROSE package?

    • @bkrai
      @bkrai  Před 4 lety

      Yes that’s fine too.

  • @sumaiyasande150
    @sumaiyasande150 Před 4 lety +1

    Sir, when we fit any model to the training data, does that already tune the model parameters or we have to tune them manually before testing it?

    • @bkrai
      @bkrai  Před 4 lety

      It depends on the model you are using. Random forest model doesn't need much tuning.

    • @sumaiyasande150
      @sumaiyasande150 Před 4 lety +1

      Thank you Sir

    • @bkrai
      @bkrai  Před 4 lety

      welcome!

  • @perotsystemsambattur
    @perotsystemsambattur Před 5 lety +1

    Sir,thank you so much for the vdo and sharing your expertise. One question I hv is "should be performed SMOTE before feature selection from the imbalanced data?" . Please answer

    • @bkrai
      @bkrai  Před 5 lety

      I would say yes.

    • @perotsystemsambattur
      @perotsystemsambattur Před 5 lety

      @@bkrai thank you Sir

    • @perotsystemsambattur
      @perotsystemsambattur Před 5 lety

      Sir. I hv a model in Random Forest. Dataset is in csv file. If i want to make a web interface and deploy what should I do. Is storing the dataset in database like mysql mandatory. I want users to give values through four textboxes/input box and provide the predicted result in a text form.

  • @akashprabhakar6353
    @akashprabhakar6353 Před 4 lety +1

    Thanks for this video sir!
    In your video "Logistic Regression with R: Categorical Response Variable at Two Levels (2018)"....we converted rank also into a factor...after doing so my accuracy in coming out to be 1 in all cases of under,over,both and random sampling....Kindly clear my doubt...why we didnt convert rank into factor in this video?? and why just converting that into factor we got 1 accuracy???

    • @bkrai
      @bkrai  Před 4 lety

      Make sure you check accuracy based on test data. That is unlikely to be 1.

  • @luisenovikov1675
    @luisenovikov1675 Před 2 lety +1

    Hey,
    thanks for your video. You explained things very well! As I am doing regression on a multiclass factor (starting point polr model) my question somehow differs.
    My outcome is a 10-class (likert scale) question of a survey with 1 „something isnever justified“ to 10„… is always justified“-> So ordered factor levels (1:10)
    The densityplot shows that levels >5 have near to zero density. The former literature (looking at the same survey question) recoded to a 4-class factor (also ordered) by „collapsing“ the levels

    • @bkrai
      @bkrai  Před 2 lety

      Look at number of data points at each level and if you find some classes have very few data points, then it may be a good idea to group some categories.

    • @luisenovikov1675
      @luisenovikov1675 Před 2 lety

      ​@@bkrai
      Thanks a lot. I decided to recode the factor by collapsing the lowest categories (Option 1) and collapsing all the categories expept the highest value (option two).
      But one further question:
      My predictors are also very imbalanced. F. e. I have got:
      - marital status (8 classes) where single or married are prevalent (~ 50%)
      -likert scale questions with 4-classes and with 10 classes where the „middle categories“ are relatively few.
      My idea was to code alternative factors for (Single/ Married/ living together); and for the likert scale :
      -Option 1: Collapsing them to balanced factors, or
      -Option 2: Taking them as numerical predictors;
      Next thep would be to run models with the alternative options. Would you agree with such an approach? Or am I „cheating the data“ if I am recoding my predictors in that way? Also I am not sure if numerical predictors are useful in ordered logistic/ probit regression.
      Thanks in advance! And I now watched the majority of your videos. They are very helpful :-)

  • @nimishapapineni2216
    @nimishapapineni2216 Před 4 lety +1

    Hello sir, I got struck at random forest step..it is showing me the error of this kind.."Error in randomForest .default(m,y,...)
    NA/NAN/Inf in foreign function call (arg 1)
    In addition: warning message:
    In data.matrix(X): NAs introduced by coercion"

    • @bkrai
      @bkrai  Před 4 lety +1

      Make sure you do not have missing values

    • @nimishapapineni2216
      @nimishapapineni2216 Před 4 lety

      And one more doubt... How does p value effect on output in both sampling

  • @younesgasmi8518
    @younesgasmi8518 Před 4 měsíci +1

    Thank you so much bro I..my question is if Can I use undersampling techniques before Splitting the dataset into training and testing?

    • @bkrai
      @bkrai  Před 4 měsíci

      Testing data should be similar to data expected when using the model. That's why we do it after splitting.

    • @younesgasmi8518
      @younesgasmi8518 Před 4 měsíci +1

      @@bkrai but I think it isn't really a big deal.. because it prevents data leakage..I think also the time series has this problem not classification ...and even there are papers where the authors have used undersampling before Splitting?

    • @bkrai
      @bkrai  Před 4 měsíci

      If that works for your data, then should be fine.

  • @sarasatipalawita
    @sarasatipalawita Před 5 lety +1

    Hallo Sir thank you for the video. It is really helpful! However, I still have a question. So I tried to do undersampling. And I got an error that the response variable must have 2 levels
    --> "Error in (function (formula, data, method, subset, na.action, N, p = 0.5, : The response variable must have 2 levels."
    So it means it can only work for 2 response level or I might do something wrong?
    I have the same code as you have.
    under

    • @bkrai
      @bkrai  Před 5 lety

      If you have more than 2 levels, you can try 2 at a time.

    • @sarasatipalawita
      @sarasatipalawita Před 5 lety +1

      @@bkrai thank you for feedback sir! however, I still do not understand what you mean by trying 2 at a time.. could you be more specific? thank you!

  • @mohitagarwal8264
    @mohitagarwal8264 Před 6 lety +1

    sir i am using data of network attack where i have 3 levels in response variable so while using the functions ovun.sample i am getting the error that response must have 2 levels so pls help me in this

    • @shauryasingh2212
      @shauryasingh2212 Před 6 lety

      have you solved your dataset, what you used to solve imbalance in dataset of
      multiclass classification

    • @bkrai
      @bkrai  Před rokem

      Sorry seeing this now. I hope you already figured out.

  • @alessandrorosati969
    @alessandrorosati969 Před rokem

    This command rose

  • @prudhviraj148
    @prudhviraj148 Před 5 lety +1

    sir , our accuracy should be more or less ?? if our accuracy is more it is again affecting at sensitivity or specificity ?
    Q2:WE do over sapling or under sampling according to make our "0" or "1" to predict more accuracy then it will impact on other .it is not issue?

    • @bkrai
      @bkrai  Před 5 lety

      Accuracy is always higher the better. But there may be situations where sensitivity or specificity may be more important and in those cases trying to improve them may lead to a lower overall accuracy.

    • @prudhviraj148
      @prudhviraj148 Před 5 lety +1

      Thank you sir

  • @estadisticaparatodos6070

    Godó video. Is necessary calibrate the probabilities? How can i do this? Thanks

    • @bkrai
      @bkrai  Před 5 lety

      Its not necessary.

  • @mohitagarwal8264
    @mohitagarwal8264 Před 5 lety +2

    Sir this rose package is only valid when you have 2 outcomes(0,1).What if i am facing multi class imbalance problems like there are 4 outcomes(0,1,2,3).How to handle such imbalance ?can you share with us

    • @biswajitdash3855
      @biswajitdash3855 Před 5 lety

      Hi Mohit. Did you find a way to handle data imbalance in multi level classification problem?

    • @bkrai
      @bkrai  Před rokem

      Sorry seeing this now. I hope you already figured out.

  • @akashprabhakar6353
    @akashprabhakar6353 Před 4 lety +1

    Sir, one more doubt- why we have made admit a factor?..and not Rank...and when do we apply normalization and standardization...
    Which ML models do not require standardization and normalizatio. Kindly tell

    • @bkrai
      @bkrai  Před 4 lety

      A student getting admission or not getting admission is not really a rank. It is a factor type of variable. Regarding normalization and standardization, you will see that they are addressed in each video where appropriate. You can refer to top 10 here:
      czcams.com/play/PL34t5iLfZddsQ0NzMFszGduj3jE8UFm4O.html

  • @netmarketer77
    @netmarketer77 Před 5 lety +1

    So, class imbalance problem should be treated only when we want to predict a class with less instances against a class with more instances? Whereas when we predict the class with more instances , this means we do not have class imbalance problem and we should continue with our prediction? Am I right?

    • @bkrai
      @bkrai  Před 5 lety +1

      No. It is still needed in the case you described.

    • @netmarketer77
      @netmarketer77 Před 5 lety

      @@bkrai But why when I apply over , under and rose sampling ,I get less accuracy and less sensitivity. There is improvement .

  • @nimishapapineni2216
    @nimishapapineni2216 Před 4 lety +1

    Sir, what is the reason behind taking train n test samples

    • @bkrai
      @bkrai  Před 4 lety

      You may refer to following:
      czcams.com/video/EV5N-pIdvJo/video.html

    • @nimishapapineni2216
      @nimishapapineni2216 Před 4 lety

      Yes sir I have watched it but one more question sir, should we smote the date with the 70% samples??

  • @adityaupadhyaya6441
    @adityaupadhyaya6441 Před rokem +1

    Sir please make a video on multinomial Mixed effects regression. I heartily request as I find no literature to my suitability on this. 🙏

    • @bkrai
      @bkrai  Před rokem +1

      Thanks, I've added it to my list of future videos.

    • @adityaupadhyaya6441
      @adityaupadhyaya6441 Před rokem +1

      @@bkrai thank you so much sir. Heartily awaiting it🙏

    • @bkrai
      @bkrai  Před rokem

      You are welcome!

  • @ramp2011
    @ramp2011 Před 7 lety +2

    Another awesome video. Can you please share the data file?

    • @bkrai
      @bkrai  Před 7 lety

      email id?

    • @bkrai
      @bkrai  Před 7 lety +1

      I've now added link below the video itself for downloading the file.

    • @ramp2011
      @ramp2011 Před 7 lety +1

      Thank you. Appreciate your help

  • @im_karamo1907
    @im_karamo1907 Před 5 lety +2

    thank you for this video brother.. you use over sampling on the training data to have same class as 180... but this was only done on the training set so what do you do to the testing set?
    i realise that you partition the data before you employing the over sampling method and you apply the over sampling on the training set but what happens to the testing? because i think it will also be inbalance? please i need explanation here. thank you

    • @bkrai
      @bkrai  Před 5 lety

      It should only be done for training data because the prediction model is based on that. The model is not based on test data. We use test data only for assessing the model. When a final model is deployed in practice, it is likely to come across data similar to test data.

    • @im_karamo1907
      @im_karamo1907 Před 5 lety +1

      @@bkrai thanks so much for that great insight. I was thinking that because there was High class imbalance in the data set so if you do your partition there is high likely that the imbalance will affect in both the training and testing.. So if train data is over sample to avoid the imbalance then the testing data remains untouch meaning it is still imbalance. so meaning the predictions will be high in one class and low in the other.. I will implement this and see how it will go..
      Thumps up bro. Your Vids are aspiring.

    • @im_karamo1907
      @im_karamo1907 Před 5 lety +1

      i just did what you just explained above but it seems that my sensitivity was 96% and my specificity was just 20%.. i realise that although training set was over sampled but in the testing there was still class imbalance because over sampling was done after partitioning. so prediction on my unseen data which is the testing was really bias..as more patients were predicted as having the disease while they already have the disease but specificity was extremely poor.. so i really dont understand how to avoid this scenario on the testing data.. thanks again

    • @im_karamo1907
      @im_karamo1907 Před 5 lety +2

      @@bkrai thanks again, yeah we use test data to assess the model, but what of if the test is also higly imbalance? will this not affect our recall value, specifi etc?

    • @bkrai
      @bkrai  Před 5 lety +1

      Assessment has to be with actual data even if there is high imbalance.

  • @netmarketer77
    @netmarketer77 Před 4 lety +2

    Thanks for this video. How we can solve class imbalance problem if we have a response variable with 3 classes.? Thanks very much.

    • @bkrai
      @bkrai  Před 4 lety

      You can do it 2 at a time and select those 2 that have major imbalance problem.

    • @netmarketer77
      @netmarketer77 Před 4 lety +1

      Dr. Bharatendra Rai but my mission is to classify 3 classes. For example CTG dataset , all 3 classes have the same major imbalance. So how to balance them. ?

    • @bkrai
      @bkrai  Před 4 lety +1

      You can make 2 classes that have lower frequency to match with class-1.

    • @netmarketer77
      @netmarketer77 Před 4 lety +1

      @@bkrai Yes I got you now Sir. I can name 2 and 3 class (with less frequncies) as 1 and Normal as -1 so I will end up with 2 classes variable. Thanks again Dr.

    • @bkrai
      @bkrai  Před 4 lety

      Thanks for the update!

  • @nitinchoudhary3549
    @nitinchoudhary3549 Před rokem +1

    why we chosen N=500 in ROSE() function?

    • @bkrai
      @bkrai  Před rokem

      It is artificially created data and I chose a round figure of 500.

    • @nitinchoudhary3549
      @nitinchoudhary3549 Před rokem

      @@bkrai we can choose according to our no of rows?

  • @alipaloda9571
    @alipaloda9571 Před 3 lety +1

    Error in as.data.frame.default(data) :
    cannot coerce class '"ovun.sample"' to a data.frame
    I got this error how to solve this error

    • @bkrai
      @bkrai  Před 3 lety

      Difficult to say much without looking at the code. Check your code.

    • @alipaloda9571
      @alipaloda9571 Před 3 lety +1

      @@bkrai thank you for your response sir I was one mistake that was the reason I was getting that error I watched your video carefully and I resolved that error thank you sir

    • @bkrai
      @bkrai  Před 3 lety

      Thanks for the update!

  • @popi20101
    @popi20101 Před 2 lety +1

    what if the data have 4 classes and imbalance?

    • @bkrai
      @bkrai  Před 2 lety

      You can try two at a time.

  • @sm.melbaraj1682
    @sm.melbaraj1682 Před 3 lety +1

    ROSE can only be used if the classification is binary?

    • @bkrai
      @bkrai  Před 3 lety

      For more than 2, you can do two at a time.

    • @sm.melbaraj1682
      @sm.melbaraj1682 Před 3 lety +1

      @@bkrai okay.. Thankyou

    • @bkrai
      @bkrai  Před 3 lety

      You are welcome!

  • @kevinm8607
    @kevinm8607 Před 6 lety +1

    Can we use rose when we have 100 classes. I think rose is only for 2 classes. In your case 0 and 1. How to do oversampling when one have many imbalanced classes. Many thanks Kevin

    • @bkrai
      @bkrai  Před 6 lety +1

      You can take subsets with 2 levels at a time where class imbalance is present and apply this method. And finally you can combine your data.

    • @kevinm8607
      @kevinm8607 Před 6 lety +1

      Many Thanks for your reply, could you mention code/example/link how to do that (Kevin.maz155@gmail.com)

    • @deprofundis3293
      @deprofundis3293 Před 3 lety +1

      @@kevinm8607 Hi Kevin, did you ever figure out how the R code to use with the oversampling method when you have multiple imbalanced classes?

  • @kabeeradebayo9014
    @kabeeradebayo9014 Před 6 lety +4

    Thank You, Prof. for this video. I am trying to adapt this approach to my data set but I have been getting incorrect data type error as follows:
    " Error in terms.formula(formula, data = frml.env) :
    'data' argument is of the wrong type "
    when I run this:
    ovrf

    • @bkrai
      @bkrai  Před 6 lety

      what is t in data= t? Probably there is some error there.

    • @kabeeradebayo9014
      @kabeeradebayo9014 Před 6 lety

      Bharatendra Rai
      The t is my training data.

    • @kabeeradebayo9014
      @kabeeradebayo9014 Před 6 lety

      Bharatendra Rai
      I have tune it reapetedly but nothing has changed. Thank you for helping.

    • @bkrai
      @bkrai  Před 6 lety

      I'm seeing this now. I think you need to use data after $ sign.

  • @taufikwanahmad3246
    @taufikwanahmad3246 Před 5 lety +1

    sir this one is called smote ?

    • @bkrai
      @bkrai  Před 5 lety

      They are slighly different. Rose uses smoothed bootstrapping to draw artificial samples from the feature space neighbourhood around the minority class. On the other hand, Smotr draws artificial samples by choosing points that lie on the
      line connecting the rare observation to one of its nearest neighbors in the feature space.

  • @yashp1995
    @yashp1995 Před 4 lety +1

    What if there is a class with 0 and 1 both

    • @bkrai
      @bkrai  Před 4 lety

      In the example provided it will not be feasible to have a student admitted as well as not admitted.

    • @yashp1995
      @yashp1995 Před 4 lety +1

      I have a data were I have to classify gender based on websites visited but there are websites which are visited by both male and female means both 0 and 1

    • @bkrai
      @bkrai  Před 4 lety +2

      If each row represents an instance of a website visited, then it can only be visited by one.

  • @dhanashreedeshpande7100
    @dhanashreedeshpande7100 Před 5 lety +1

    Nice. But video is not properly visible initially.

    • @bkrai
      @bkrai  Před 5 lety

      I just checked it, and everything looks fine. I think it may have something to do with internet speed at your end.

    • @dhanashreedeshpande7100
      @dhanashreedeshpande7100 Před 5 lety +1

      I mean It is blurred initially. Yes may be internet issues.

    • @bkrai
      @bkrai  Před 5 lety

      Thanks for letting me know.