Get started with tidymodels and classification of penguin data

Sdílet
Vložit
  • čas přidán 11. 09. 2024
  • Learn how to use the tidymodels packages in R for modeling and machine learning with #TidyTuesday data on penguins. Check out the code on my blog: juliasilge.com...

Komentáře • 82

  • @HamJeong
    @HamJeong Před 4 lety +39

    I can't emphasize enough how useful this content is, from the screencast style, to the focus on using specific packages to the insight into the modelling process. I really love it, hope it keeps coming!!

  • @SC-pm7zd
    @SC-pm7zd Před 2 lety

    A perfect CZcams content for people wanting to know how to analyze data using R in an elegant way.

  • @Simonsayztaga
    @Simonsayztaga Před 4 lety +7

    30 minutes is the sweet spot!! Ur awesome @julia

  • @dasrotrad
    @dasrotrad Před 4 lety +5

    What a great video Julia! Thank you for such wonderful introduction to ML and for sharing your knowledge. You are indeed, awesome.

  • @kentico1234
    @kentico1234 Před 4 lety +2

    Great job, Julia .... you put a lot of effort into this very worthwhile endeavor!

  • @mocabeentrill
    @mocabeentrill Před 8 měsíci

    Clear explained and direct to the point! Thank you Julia.

  • @edGoldi
    @edGoldi Před 4 lety +1

    Many thanks Julia!!! can't wait for the next video!!!

  • @gabrielrosa9738
    @gabrielrosa9738 Před 4 lety +2

    Excelent! The content is very useful and your way to go trough it makes it easy to grasp. Thank you!

  • @WhySoBroke
    @WhySoBroke Před 2 lety

    Superbly done!! Will rewatch a couple times, lots to learn! Many thanks Julia!! ❤️🇲🇽❤️

  • @dariyatukhmetova1172
    @dariyatukhmetova1172 Před 3 lety +1

    amazing tutorial, thank you. Love how you give interesting explanation for each output value of the model.

  • @averyrobbins68
    @averyrobbins68 Před 4 lety +2

    Very helpful! Thank you very much for doing these videos. `tidy(exponentiate = TRUE)` was a new one for me. Very useful.

  • @danielalvarezmd
    @danielalvarezmd Před 4 lety +1

    Great video Julia. You are the best. Thank u very much!!!

  • @julietterose5753
    @julietterose5753 Před 2 lety

    Thank you so much for this video. Appreciate it. It is so helpful to see how it works actually

  • @yussifmohammed9324
    @yussifmohammed9324 Před 2 lety

    Thanks Julie- will like to see more

  • @socratesoliveira1176
    @socratesoliveira1176 Před 3 lety

    Very clear and easy to follow, so useful! Thank you very much!

  • @cb5231
    @cb5231 Před 6 měsíci

    thanks for this video Julia

  • @marianklose1197
    @marianklose1197 Před rokem

    great tutorial!

  • @sophiej4605
    @sophiej4605 Před 3 lety

    Great to get started a tidymodel!!

  • @cgmiguel
    @cgmiguel Před 3 lety +1

    Excellent video and content, as usual! One quick question though: what do you mean by being easier to deploy a logistic regression model than a random forest?

    • @JuliaSilge
      @JuliaSilge  Před 3 lety +2

      I was thinking about how a logistic regression model is linear so you don't need to get an R object deployed somewhere to make predictions; you can just use a flat file of model coefficients that could be incorporated into any kind of production system (no R necessary) pretty easily.

  • @malkhaz.jokhadze
    @malkhaz.jokhadze Před 4 lety +3

    Dear Julia, I want to ask you how do you execute a markdown code in the console, I mean what key do you use for that purpose. Thank you in advance.

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +8

      That's probably my most used keyboard shortcut! Ctrl+Shift+Enter for a chunk, Cmd+Enter for a line
      In RStudio, you can find them under Tools -> Keyboard Shortcuts Help, but there's just a handful that I use regularly.

    • @PatrickBateman12420
      @PatrickBateman12420 Před 4 lety

      @@JuliaSilge thanks a lot Julia!

  • @TURALOWEN
    @TURALOWEN Před 2 lety

    Amazing lecture! Thank you!

  • @crgIN07
    @crgIN07 Před 4 lety +1

    Really great, thank you! Do you have a plans to do time series analysis or a SVM model?

  • @upendra8050
    @upendra8050 Před 4 lety +1

    Dear Julia, great video, and I learned a lot about tidy models today. I have a couple of questions.
    1. For tree-based models, I can use feature importance and packages such as SHAP for interpreting them. Is this something that we can do with linear models such as logistic regression? Or in other words, can we assume coefficients of features in linear models to be the same as feature importance in tree-based models?
    2. From your analyses, you found that the bill depth is the most important feature that differentiates the sexes. Can we come up with rules/cut-offs using which we can say whether a particular bill depth corresponds to a male penguin or female penguin?
    Thanks in advance.

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +2

      Absolutely, the coefficients of a linear model give you analogous information to feature importance of a tree model. In fact, they are *better* in terms of feature importance because they literally are just which features are most important for your model, directly.
      If you want a set of rules, I would use a specific model for that: www.tidyverse.org/blog/2020/05/rules-0-0-1/

    • @upendra8050
      @upendra8050 Před 4 lety

      @@JuliaSilge Thanks Julia.

  • @jaredminetola
    @jaredminetola Před 4 lety +1

    Hi Julia, I I'm newISH to R and VERY new to predictive modelling in R. I really enjoy watching your videos! I'm wondering if you would start over and exclude Flipper_Length_mm from this model (if you were actually going to use this going forward) since it had a higher P Value in your summary statistics. Thanks!

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +6

      That would be like one step of "stepwise regression", basically, and stepwise regression has a lot of problems when applied in general. However, in real life problems (where the goal was prediction, i.e. good model fit), I probably *would* try the model without the insignificant variable term to see if it still fit about as well and then I would pick the simpler model if it did.

    • @jaredminetola
      @jaredminetola Před 4 lety

      @@JuliaSilge Thanks for the quick reply!

  • @carolinemimeault3668
    @carolinemimeault3668 Před 4 lety

    Thank you so much for making those videos!

  • @andrewnguyen3312
    @andrewnguyen3312 Před rokem

    Great video ty so much

  • @rank4816
    @rank4816 Před 3 lety

    Really instructive video, thank you!

  • @jakebersabe6511
    @jakebersabe6511 Před 2 lety

    Thank you!

  • @maxcopa83
    @maxcopa83 Před 4 lety

    I wander what the results would be if the independent fields were dummy coded. Great code as always.

  • @raydePay
    @raydePay Před 2 lety

    Would it be useful to compare the predictions-weights ("probs", I think in caret) where rf and glm divert? So, if glm-pos > rf-neg the outcome is glm, else rf?

  • @kenkoonwong2166
    @kenkoonwong2166 Před 4 lety

    thank you. very helpful!

  • @buraktiras93
    @buraktiras93 Před rokem

    Thanks for the content! I have a question. How can we change the cutoff value in glm when we use tidymodels?

    • @JuliaSilge
      @JuliaSilge  Před rokem

      Do you mean using the probability threshold to decide what label to predict? You can get out the probabilities via `type = "prob"` and can go from there as you wish, or you may be interested in using probably:
      probably.tidymodels.org/

  • @brendanmcewen7190
    @brendanmcewen7190 Před rokem

    Around minute 22:00 you're mentioning that the (generalized) linear model did just as well at classifying sex as the random forest model, despite not being able to identify interactions (e.g. a flipped dimorphism for one of the species). Isn't this rather expected, though, as the dataset itself contained no interactions between sex and the other identifying characteristics? Would the RF model have performed better if, say, one of the species had an inverse relationship between sex and flipper/beak dimensions?

    • @JuliaSilge
      @JuliaSilge  Před rokem +1

      I think it's a little strong to say there are *no* interactions in the penguins dataset, as for example the slope for bill depth vs length isn't the same for all species and/or sexes. However, yep, the fact that the linear model performs just as well does indicate that any interactions aren't that important and we would expect a random forest model to do better when there are more important interactions.

    • @brendanmcewen7190
      @brendanmcewen7190 Před rokem

      @@JuliaSilge Gotcha, that makes sense. Thanks for the reply on a two year old video! Ben Bolker recommended I look into TidyModels, so I've been watching lots of your videos. Very clear and informative!

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 Před 3 lety

    Interesting point about not building a classification model for species. However, perhaps a model classification would work better than one made by a biologist. I would think that a model would definitely do a better job than a beginner or amateur. The classification of any sort of thing - being it a rock or a bird is often fraught with mistakes.

  • @maksim0933
    @maksim0933 Před 4 lety

    I have a very silly question: for practical reason of filling missing values in particular dataset (taking apart all great regressions) it wouldn't better fill NA with the help of some packages, for example mice ?

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +1

      Here, sex is the thing we are predicting so we would need to be careful using the predictors to impute the outcome and then also to predict the outcome. If on the other hand you want to use imputation for predictors, tidymodels has a number of functions for that in the recipes package: recipes.tidymodels.org/reference/index.html#section-step-functions-imputation

  • @felipetorres4464
    @felipetorres4464 Před 4 lety +1

    Hi Julia. Why is this video call "unknown"?

  • @deiro04
    @deiro04 Před 2 lety

    Amazin

  • @sotirismargaritis4965
    @sotirismargaritis4965 Před 4 lety

    May i ask what the lines 9 to 15 does?
    theme_set(theme_plex()) is from rstheme package which defines the r studio theme?
    Thank you very much

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +4

      theme_set() is for ggplot2, to set the what the plots look like: ggplot2.tidyverse.org/reference/theme_get.html
      The part above that sets options for knitr chunks, such as whether to cache results, whether to print messages and warnings, what size to prints figures, etc. You can read more about knitr chunk options here: yihui.org/knitr/options/

    • @sotirismargaritis4965
      @sotirismargaritis4965 Před 4 lety

      @@JuliaSilge Thank you very much for the quick response. I hope you will make in the future some interactive courses like supervised ml case studies

  • @yujuansun8522
    @yujuansun8522 Před 2 lety

    Your video is so useful! I use the same method as yours but I got this Error message when I use fit_resamples "Error: For a classification model, the outcome should be a factor." Do you know how to fix this problem? Thanks in advance!!!

    • @JuliaSilge
      @JuliaSilge  Před 2 lety

      It sounds like you may be fitting a classification model to data with a numeric outcome. Try choosing a model that is a good fit for your particular data, like a regression model if you have a numeric outcome.

  • @Mohamed-sq8od
    @Mohamed-sq8od Před 3 lety

    you are awesome

  • @elOtorongo96
    @elOtorongo96 Před 3 lety

    Awesome

  • @selecta_ssbm
    @selecta_ssbm Před 3 lety

    Love this! How do I got an error at the last step however with the following:
    Error: No tidy method for objects of class ranger

    • @JuliaSilge
      @JuliaSilge  Před 3 lety

      Seems like you tried to tidy the random forest instead of the logistic regression model. A random forest model doesn't have simple coefficients so can't be tidied in the same way that a logistic regression model can.

  • @byronpop2
    @byronpop2 Před 4 lety

    Hi @julia, I love your videos! Thank you so much for making them. I am following along and using my own data for some modeling and unfortunately when I try to train the random forest model with:
    rf_rs %
    add_model(rf_spec) %>%...
    I get the following error: "model: Error: spark objects can only be used with the formula interface to `fit()` with a spark data object."
    Any idea what might be going on? For context, my data is described below:
    tibble [4,428 × 12] (S3: tbl_df/tbl/data.frame)

    $ deployment : Factor w/ 13 levels
    $ realty_status : Factor w/ 2 levels "opted IN","opted OUT":
    $ property_county : Factor w/ 356 levels "
    $ property_state : Factor w/ 44 levels
    $ loan_amount : num [1:4428]
    $ total_income : num [1:4428]
    $ age : num [1:4428]
    $ n_schooling_years : num [1:4428]
    $ n_owned_properties: num [1:4428]
    $ n_dependents : num [1:4428]
    $ device_type_start : Factor w/ 4 levels
    $ completion_time : 'difftime' num

    • @JuliaSilge
      @JuliaSilge  Před 4 lety

      I don't think that I can get enough info in the comments here to help. Can you post on RStudio Community with a little more detail (preferably a whole reprex, if possible) so we can check it out and see what's going on? rstd.io/tidymodels-community

  • @farnooshsheikhi
    @farnooshsheikhi Před 4 lety

    Thank you Julia. This was really helpful. Quick question, do you always create a balanced data where you have the same number of cases and controls before modeling and then resample from that data set? I was wondering if this is a general approach to build predictive models. Thank you again. I love your videos :)

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +1

      I don't think it's best practice to *always* create a balanced training set, but often this is a helpful preprocessing step to build a model that can learn to recognize both, say, the majority and minority classes. One important note is that it is best to resample the original, imbalanced dataset, and then do the over/undersampling on the resamples, to avoid data leakage. In tidymodels, we have tools for dealing with imbalanced data in the themis package:
      themis.tidymodels.org/

    • @farnooshsheikhi
      @farnooshsheikhi Před 4 lety

      @@JuliaSilge thank you so much for getting back to me. I'll check the themis package out :)

    • @TheFrankyguitar
      @TheFrankyguitar Před 4 lety +1

      I use the SMOTE algorithm contained in themis package. You just have to add one line in your recipe: step_smote(your_response_variable, smote_parameters).

  • @jamespaz4333
    @jamespaz4333 Před 3 lety

    Great presentation! How can I include grid search into my recipes?

    • @JuliaSilge
      @JuliaSilge  Před 3 lety +1

      You can tune many recipe parameters, in much the same way you tune model parameters. You can check out some examples here:
      www.tidymodels.org/learn/work/tune-text/
      And here:
      www.tidymodels.org/learn/work/bayes-opt/

    • @jamespaz4333
      @jamespaz4333 Před 3 lety

      @@JuliaSilge amazing! Thank you!!!!!

  • @sabbamussadiq9818
    @sabbamussadiq9818 Před 2 lety

    Mam ,
    Can you kindly teach constructing 2 or 3 variables on the same graph of ROC curve in SPSS for easy visual comparison.. like you made in this video .. but this does not look like SPSS

    • @JuliaSilge
      @JuliaSilge  Před 2 lety

      Well, it definitely is not SPSS! 😁 If you can outline in detail more of what you are trying to do with a reproducible example, I suggest you post on RStudio Community where folks will be able to help you:
      rstd.io/tidymodels-community

    • @sabbamussadiq9818
      @sabbamussadiq9818 Před 2 lety

      @@JuliaSilge well , thankyou for the reply Mam.
      I am comparing 2 biomarkers in a disease diagnosis… so needed ROC curve ..but I was not able to plot both on same graph… like you did ..(ploting many ROC curves on one graph)….
      Will look at the site you have mentioned… thankyou

  • @nadiamekhloufi8744
    @nadiamekhloufi8744 Před rokem

    Please you can sheer with us the script code

    • @JuliaSilge
      @JuliaSilge  Před rokem +1

      Check out the description here on CZcams, where I always include that info:
      juliasilge.com/blog/palmer-penguins/

  • @oddsratio4070
    @oddsratio4070 Před rokem

    Its confusing that all your other videos you use `recipes`, but not here?

    • @JuliaSilge
      @JuliaSilge  Před rokem

      If you want to learn about using a formula vs. a recipe, I recommend checking out these sections of our book:
      www.tmwr.org/base-r.html#formula
      www.tmwr.org/workflows.html#workflow-encoding
      www.tmwr.org/recipes.html

    • @oddsratio4070
      @oddsratio4070 Před rokem

      @@JuliaSilge Thanks! I am also ordering the book in hardcopy on Amazon today :)

  • @Blackhole-yy6yq
    @Blackhole-yy6yq Před 4 lety

    i love you julia.. how r u today