Video není dostupné.
Omlouváme se.

To downsample or not? Handling class imbalance in bird feeder observations

Sdílet
Vložit
  • čas přidán 17. 01. 2023
  • Will squirrels will come eat from your bird feeder? Let's fit a model with #TidyTuesday data on bird feeders both with and without downsampling to find out. Check out the code on my blog: juliasilge.com...

Komentáře • 23

  • @wouldntyaliktono
    @wouldntyaliktono Před rokem +11

    One way I like to think about this question of downsampling is whether it alters the bias term of my model. Rebalancing the data will force the model to assume that the global average probability of SQUIRREL is 50%, but that isn't the case in the empirical data. And that can affect how successful my models are when they're deployed to production.

    • @JuliaSilge
      @JuliaSilge  Před rokem +2

      Love this!

    • @natarajanlalgudi
      @natarajanlalgudi Před rokem

      Down sampling will have an impact in production as it will affect the model's ability to generalize to unseen data. Weighted loss function approach could actually yield far lesser variance, and far better model performance on unseen data outside of the training and validation process.

    • @JuliaSilge
      @JuliaSilge  Před rokem

      @@natarajanlalgudi In tidymodels, a similar/related approach is tuning using a custom cost function for classification:
      yardstick.tidymodels.org/reference/classification_cost.html

  • @xxXXCarbon6XXxx
    @xxXXCarbon6XXxx Před rokem +2

    I love squirrels, they are so cute so I could never be a hater. We were in Washington at the Vietnam memorial wall & my brother-in-law offered a squirrel a piece of banana. It bit his finger and I laughed so hard (yes they may have rabies!). Adorable.

  • @alexandroskatsiferis
    @alexandroskatsiferis Před rokem +1

    Nice demonstration showing the complexity of imbalanced classes. An issue with choosing specificity, sensitivity and similar metrics, is that they are all dependent on the decision threshold (in this case 0.5) which further complicates decision making.

  • @CaribouDataScience
    @CaribouDataScience Před rokem

    Thanks for sharing!!

  • @517127
    @517127 Před rokem

    Excelent work. I learn a lot with your videos

  • @cuysaurus
    @cuysaurus Před rokem

    You look awesome, Julia.

  • @yangyang6008
    @yangyang6008 Před rokem +1

    Hi Julia, how can we define a class imbalance? In the example, "squirrels" is 4 times more than "no squirrels". If "squirrels" is only 1.5 times more than "no squirrels", is it still called imbalance?

    • @JuliaSilge
      @JuliaSilge  Před rokem

      I think anything other than perfect balance (i.e. the categories are equal) is imbalance, but in typical modeling projects you don't start having problems until you have proportions like 5-to-1 or 10-to-1.

    • @yangyang6008
      @yangyang6008 Před rokem

      @@JuliaSilge Thank you for your help Julia!

    • @natarajanlalgudi
      @natarajanlalgudi Před rokem

      @@JuliaSilge 4:1 is on the borderline of "serious imbalance" I'm guessing. There could be some learners tuned better using resampling or penalties and some not so.

  • @joshuapooley8993
    @joshuapooley8993 Před rokem

    I am not sure if @ijessup is into data science, but if she were then this would be the video for her. #Gary

  • @shauryamehta5339
    @shauryamehta5339 Před rokem

    Hi I have this question that if i will use more than two different models in my work flow set for two different specification then how many models in total will be computed? For example lets say i want to compute two models one be using regularized regression and other be a tree based model with two different specification one be without down sample and other be with downsample so will in toal 4 models will be computed? Two for regularised regression and two for lets say random forest
    Thanks

    • @JuliaSilge
      @JuliaSilge  Před rokem +1

      If I'm understanding you correctly, it sounds like you will have 4 models (logistic regression + downsampling, logistic regression without, tree-based + downsampling, tree-based without). When you decide to compare them, they will be fit to your resamples. If you have 10 folds, then you will fit 40 models to understand which will be the right one for you.

  • @ismaelmontero4811
    @ismaelmontero4811 Před rokem

    Hi Julia, thank you very much for your videos. I have a question. I have a dataset that only has nominal variables transformed as factors (it's a classification problem), however, when I try to use your code, I get an error:
    error: Some columns are non-numeric. The data cannot be converted to numeric matrix: 'ICode_Weather', 'ICode_Gender', 'ICategory_Age', 'iCode_Accident_Category', 'ICategory_Vehicle', 'ICategory_Time', 'BDrugs', 'BAlcohol', 'Week_Day', 'IZone'.
    There were issues with some computations A: x1
    Can you give some advice? Thank you very much.

    • @JuliaSilge
      @JuliaSilge  Před rokem

      You'll want to convert those to dummy or indicator variables using `step_dummy()`. Read more about this here:
      recipes.tidymodels.org/articles/Dummies.html

    • @ismaelmontero4811
      @ismaelmontero4811 Před rokem

      @@JuliaSilge Thank you for the information you shared, it was helpful. Do you know of any ways I could obtain the marginal effects?

    • @JuliaSilge
      @JuliaSilge  Před rokem

      @@ismaelmontero4811 Many of the typical methods for getting marginal effects will work just fine. Here is an example of generating partial dependence profiles: www.tmwr.org/explain.html#building-global-explanations-from-local-explanations

  • @yangyang6008
    @yangyang6008 Před rokem +1

    Hi Julia, thank you for the amazing tutorial! I wonder if it is possible to include Extreme Learning Machines in Tidymodels? Extreme learning machine (ELM) is a training algorithm for single hidden layer feedforward neural network (SLFN), which converges much faster than traditional methods and yields promising performance. The algorithm is currently included in the R package "elmNNRcpp" and "ELMR". Thank you.

    • @JuliaSilge
      @JuliaSilge  Před rokem

      Not currently, no! You might be interested in learning how to create a parsnip model for it, like this:
      www.tidymodels.org/learn/develop/models/
      Feel free to ask on GitHub or RStudio Community if you run into problems!

    • @yangyang6008
      @yangyang6008 Před rokem +1

      @@JuliaSilge Thank you Julia and I will try to create a parsnip model for ELM. Hopefully, Tidymodels will update to include the algorithm in the future as ELM is very popular nowadays in machine learning.