Impute missing data and handle class imbalance for Himalayan climbing expeditions

Julia Silge

zhlédnutí 9 982

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 14. 07. 2024
Watch along to understand how to use tidymodels packages in R for predicting survival from #TidyTuesday data on climbing expeditions in the Himalayas. Check out the code on my blog: juliasilge.com/blog/himalayan...
Věda a technologie

Komentáře • 46

@tiernanmartin3729 Před 3 lety ⁺⁹
Very accessible and easy to follow (as always), and I am glad that you spent some time at the end interpreting the model outputs. All of your #TidyTuesday videos are informative, but I think this one does a particularly good job at demonstrating how the tidyverse framework really facilitates the transform-visualize-model-REPEAT workflow. Keep up the great work!
@tanguym8312 Před rokem
Hello, I wanted to tell you a HUGE thank you for your video that helped me doing an Oversampling in a my Statistics project ❤ +1👍
@Markste-in Před 3 lety ⁺²
nice vid and thx for taking the time to also make a nice write up and blog post with explanations and code! keep up the good work
@brendenmorley2643 Před 3 lety
Another great video.....The TidyTuesday flow is great. Between you and your comrades I understand so much more, and have grown immensely in the R world. Me and my boss thank you!
@garyboy7135 Před 3 lety
Love this tutorial thank you! Now I realize how to deal with logical variable when using smote and why using simple upsample with powerful model engine would yield a overly optimistic evaluation on the train set
@alexandroskatsiferis Před 3 lety ⁺²
Hello Julia! Very nice demonstration and of course wise selection of the logistic regression which comes with nicely achieved interpretability. I would really like to see another video that combines classification modelling and fairness implementation. Fairmodels package could be used, especially for variables related to ethnicity, sex etc.
@maximverbal1083 Před 3 lety ⁺¹
Thank you Julia for everything you share with us.
Nevertheless, because of I have been studying for a while about missing data, I have to say whatever situation when you apply imputation you should never use mean or median imputation even the data is MCAR.
Also most of cases the data MAR or MNAR. If you don't want to involve multiple imputations you can simply use stochastic regression or KNN imputation in case missingness proportion is less than 30% and data is MAR.
You can check the data is MCAR or not just using Little's MCAR test.
Worst case when data is MNAR there is nothing to do much.
@JuliaSilge Před 3 lety ⁺¹
For sure this is a situation when it is missing not at random, as the ages are missing for the hired expedition members (like the native Sherpa climbers) at a much higher rate than folks who came to climb the peaks from elsewhere. A tough situation for imputation, and I should have been more careful to point out the problems. 👍
@Mr_nn23 Před 3 lety
Thank you for sharing! Keep up the great work!
@marianklose1197 Před rokem
thanks a lot for these great tutorials!
@briancostello939 Před 3 lety
Great video as always! I am curious why you decided not to remove the older observations, as they clearly don’t represent new observations. I ask because I would envision a use-case for something like this to be predicting if someone will die before undertaking the expedition (obviously this would lead to upsampling being more relevant).
@JuliaSilge Před 3 lety ⁺¹
I was interested in a big picture view of the whole historical dataset, but this would be a great way to go for the use case you describe here.
@mcrosignani Před 8 měsíci
Hi Julia, I have tried the step_downsample(died) alternative and it worked a bit better on all metrics. Maybe not enough to compensate for the applicability of a logistic regression.
@jansenai6764 Před 3 lety ⁺¹
Again great video Julia, this was very helpful! I would like to ask some questions- because I got confused at the end on how to interpret the odds-ratio. Here goes my interpretation (That's probably wrong haha). Peak_id_EVER: 0.294 means that a hiker climbing mount everest is likely to survive 0.294 times more that climbing other peaks (which is bad). Season_Summer: 7.070 means that climbing on a summer makes a climber likely to survive 7.070 times than if he climbed in other seasons. Does this mean that the logistic regression model references "survived" as the positive case? Continuing this up-I'm curious to know how does R decide which category in the target variable chooses as the positive case. I think this is crucial for model interpretation- for one the specificity refers to the positive case the R chooses and not knowing which positive case your model chooses will lead to misinterpretation. Your response will be very munch appreciated! Keep it up!
@JuliaSilge Před 3 lety ⁺⁴
This is very close to right. The odds ratios are compared to the base levels, so for Everest, it is compared to the base level, which is peak_id_CHOY. For seasons, it is compared to that base level, which was autumn. If you look at the levels of `died`, it is "died" first and then "survived", which for most R models makes "survived" the event of interest or positive case. This can be tricky, though. In yardstick you can handle this level more explicitly with the `event_level` argument:
yardstick.tidymodels.org/dev/reference/sens.html
@jansenai6764 Před 3 lety
@@JuliaSilge Thanks for this clarification it helped a looooot! I hope you'll continue creating high quality Data Science content!
@singhvaibhav033 Před 3 lety
Great Video Julia, Why didn't you consider p-value while plotting estimate. I saw all of them 0 at some point in the video. (are all of them are significant ?)
@JuliaSilge Před 3 lety
In this case, yes, all of those p-values were very small. This would definitely be something to check! In the case of this dataset, the predictive capacity of the model is more worrisome than how statistically significant the coefficients are. You can read a bit more about this here:
www.tmwr.org/software-modeling.html#predictive-models
@jaradj876 Před 3 lety
Hi Julia, thank you so much for these walkthroughs! I have a question about this one. When you prepped and baked the recipe, here you did those steps from the console. Is there a reason you didn’t put “prep” and “bake” into the tail end of the recipe chunk or as a separate code chunk?
@JuliaSilge Před 3 lety ⁺¹
I used `prep()` and `bake()` to make sure everything was going as I expected with my recipe, but since I was using a workflow, I didn't need to do anything to the recipe itself to start using it in modeling. Using workflows takes care of the low-level recipe details. You can read a little more about that here:
www.tmwr.org/workflows.html#workflows-and-recipes
@jaradj876 Před 3 lety
Ahhhh, ok, so the workflow will actually calculate the steps detailed in the recipe. I watched the part where the workflow ran and it makes sense now, thank you again! Happy new year!
@AliGunerMD Před 3 lety
Thanks Julia.. This video and the Penguin video are very informative for logistic regression modeling.
not for accuracy or AUC_ROC, but for some other metrics including NPV, Sens, Spec, the levels of the target variable seems important.
What is the default selection for "positive class" in TidyModeling..? should we change it in the beginning (with fct_relevel), or any recipe for this?
Thank you for your efforts..
@JuliaSilge Před 3 lety ⁺¹
In tidymodels, the default is to consider the first level the level "of interest", but you can change that, either by changing the levels, like you mentioned via something like `fct_relevel()`, or from within the yardstick functions themselves. For example, check out the `event_level` argument for sensitivity:
yardstick.tidymodels.org/reference/sens.html
@AliGunerMD Před 3 lety
@@JuliaSilge Thank you for your time Julia.. you (and your modeling approach) are great.. "event_level" is definitely what I look for.
@JerryWho49 Před 3 lety
Thank you for this video! I’ve got a question about a possible information leakage: When you set up your recipe you pass the training data to it. But when you run the recipe for each resampled data we would like to use the data of each fold to be used for imputation. So does tidymodels ensure that this happens? In your blog post you are writing that nothing is computed when we define the recipe. But why do we pass the training data here, when it isn‘t used?
Thanks again for all of your videos. I like them. Keep going.
@JerryWho49 Před 3 lety
Okay, I think I‘ve found the answer to my question by myself: recipe() uses only the structure of the data (www.tidymodels.org/start/recipes/)
@JuliaSilge Před 3 lety ⁺¹
@@JerryWho49 Yes, that is exactly right! When fitting on resampled data, the preprocessing recipe is estimated on the "analysis" (like training) part of the resample and then evaluated on the "assessment" (like testing) part of the resample. Avoiding information leakage is a big part of the design of recipes (and other tidymodels packages) and why a lot of it works the way it does.
@vineetsansi Před 3 lety
Hi Julia, thanks for making & sharing great learning content, these are really helpful. I have a small doubt here: Aren't we using "died" as positive class. I confirmed this using caret confusionMatrix and you helped me in implementing on stackoverflow with the same and thanks again for helping.
So if "died" is our positive class then doesn't that mean that "citizenship_UK", "citizenship_US" will likely to die more than "citizenship_Nepal" ?
I mean the positive estimates should reflect coefficients of dying in the last plot but that will make summers more dangerous to climb which is not true as we saw that in descriptive analysis?
I am new in R so it may be a silly / stupid question to ask.
@JuliaSilge Před 3 lety
Not a silly question! The confusion matrix function in caret is using a different "rule" for which level to use as the positive class than the yardstick functions. To get the right answer, change it via the `positive` argument to the `confusionMatrix` function.
@vineetsansi Před 3 lety
@@JuliaSilge Thanks for clearing the doubt. Now it makes sense to me :)
@grvsrm Před 3 lety ⁺¹
Hey Julia, Thank you so much for another useful screencast. I just have a small doubt, would appreciate your response on this. Success of the expedition here has been used as a feature which seems to me another outcome. I mean, at the time of prediction that someone will die or not, we won't be having this information that the mission was a success or not, so we should not use it as a feature. Your views please.
@JuliaSilge Před 3 lety ⁺¹
That is a great point, and especially important if you are building a model for purely predictive purposes. Models can be built for several purposes:
www.tmwr.org/software-modeling.html#types-of-models
There are some situations where it makes sense to include expedition success (understanding how it impacts likelihood of death, controlling for other factors) but if the goal is to build a model for predictive purposes, then yes, you want to think carefully about what data you will have at the time of prediction for new data.
@grvsrm Před 3 lety
Thanks Julia. That explains quite everything I wanted. 👍🏻
@zegpi1821 Před 3 lety ⁺¹
When calling members_recipe %>% prep() %>% bake(new_data = NULL) returns the following Error:
Please pass a data set to `new_data`.
Anybody knows about this? I'm using recipes v 0.1.13.
@JuliaSilge Před 3 lety ⁺¹
Sorry about that! We're working on a new version of recipes which takes care of that; you can install using devtools::install_github("tidymodels/recipes"). If you can't install GitHub packages, you could use juice() instead for the time being.
@davidjackson7675 Před 3 lety ⁺²
Remember there are "old climbers" and "bold climbers" but no "old bold climbers" ....
@shauryamehta5339 Před rokem
Hiii.. I just have a question? Can you help me with when you should tune parameters in random forest and when we can avoid it
@JuliaSilge Před rokem
A random forest model typically performs quite well even without tuning (as long as there are enough trees), but you can typically squeeze out a little better performance by tuning. You don't tune the number of trees, though. You can read a bit more here:
stats.stackexchange.com/questions/344220/how-to-tune-hyperparameters-in-a-random-forest
stats.stackexchange.com/questions/348245/do-we-have-to-tune-the-number-of-trees-in-a-random-forest/
@damp8277 Před 3 lety
Hi, Julia! I have followed step by step, but when I try to run "glm_rs" it shows an error "internal: Error:
In metric: `accuracy`
| Problem with `summarise()` input `.estimate`. | x `estimator` is binary, only two class `truth` factors are allowed. A factor with 1 levels was provided. | i Input `.estimate` is `metric_fn(truth = died, estimate = .pred_class, na_rm = na_rm)`.". I'm using the latest tidymodel and I'm in Windows 10, but I can't seem to find the reason for the error. I even copied the chunks in your blog to make sure it wasn't a typo but I get the same error.
@JuliaSilge Před 3 lety ⁺¹
I just re-downloaded the data and ran the code so I believe everything still works in this screencast. That error sounds like there is a problem with the outcome itself `died`, so take a look at what it happening with that column. Try using `prep()` and `bake()` on your recipe to see if something has gone wrong perhaps with that.
@damp8277 Před 3 lety
@@JuliaSilge It worked, I was in zombie-mode and I just copied the "Build the model" portion, but the error was "mutate(died = case_when(died ~ "survived", TRUE ~ "survived"))", so no one had died (good news). I'm loving this video because in geology missing data and, especially, class imbalance is very common
@bradykile Před 3 lety
Why do you use step_other in the model recipe for some factors instead of doing all of them at the beginning together?
@JuliaSilge Před 3 lety
The reason to use `step_other` would be that you want to estimate/train which factor levels to keep on your training data, understanding that in real life, you may get new levels or a different breakdown of levels on new data (which is what we use the testing data to stand in place for). Using a recipe step like `step_other` lets you be more careful and explicit in doing this on training data and accounting for what may happen with the testing data. If your particular situation (no new levels ever, no real changes in the distribution) is such that this doesn't matter much, then it doesn't make a real difference if you do this in the recipe vs. before the recipe at the beginning using something like `forcats::fct_lump`.
@mikhaeldito Před 3 lety
Why didn't you just use `fct_lump` instead of the `step_other`?
@JuliaSilge Před 3 lety
The reason to use `step_other` would be that you want to estimate/train which factor levels to keep on your training data, understanding that in real life, you may get new levels or a different breakdown of levels on new data (which is what we use the testing data to stand in place for). Using a recipe step like `step_other` lets you be more careful and explicit in doing this on training data and accounting for what may happen with the testing data. If your particular situation (no new levels ever, no real changes in the distribution) is such that this doesn't matter much, then it doesn't make a real difference if you do this in the recipe vs. before the recipe using `fct_lump`.
@RavinderRam Před 3 lety
please do some kaggle project.. with tidytext OR tidymodel.. I'm waiting

Další v pořadí

Automatické přehrávání

Predict injuries for Chicago traffic crashes with tidymodels