Video není dostupné.
Omlouváme se.

Predict childcare costs in US counties with xgboost and early stopping

Sdílet
Vložit
  • čas přidán 19. 08. 2024

Komentáře • 20

  • @michaelmahoney3806
    @michaelmahoney3806 Před rokem +5

    I don't believe that I have ever watched one of your videos that I didn't come away with some new nugget. Thanks, Julia!

  • @hesamseraj
    @hesamseraj Před rokem +1

    As always, thank you for such great screen cast.

  • @wilrivera2987
    @wilrivera2987 Před rokem

    Dream job . To work in Posit

  • @tofreddy
    @tofreddy Před rokem

    I stumbled into your channel. Thank you for the teachable moment.

  • @carvalhoribeiro
    @carvalhoribeiro Před rokem

    Very Very useful. Thank you so much Julia !

  • @geralgariza7199
    @geralgariza7199 Před rokem

    nice work! well done!

  • @CaribouDataScience
    @CaribouDataScience Před rokem

    Thanks, that was interesting!

  • @djangoworldwide7925
    @djangoworldwide7925 Před rokem

    Hey.. rsample::validation_set does not exist anymore. As to 24-06-2023 we can use validation_split/time_split/group_validation_split. I had a feeling it was the validation_split anyway but i wonder, maybe i should use the dev version?

  • @anselmekouame1913
    @anselmekouame1913 Před rokem

    Hi Julia, how might a multicollinearity affect the machine learning model? If multicollinearity is found, should we remove variables that are highly correlated?

    • @JuliaSilge
      @JuliaSilge  Před rokem +4

      If you are using a linear model, correlated features can be a big problem! In cases like that, you would want to remove features that are highly correlated with other ones, or use something like PCA. Check out feature engineering approaches like these:
      recipes.tidymodels.org/reference/step_corr.html
      recipes.tidymodels.org/reference/step_pca.html
      Tree-based models tend to do OK with correlated features and it often doesn't really help to handle them in a special way. Just crank it on through the model!

    • @anselmekouame1913
      @anselmekouame1913 Před rokem

      @@JuliaSilge thank you bunch.

  • @omoniyitemitope6113
    @omoniyitemitope6113 Před 5 měsíci

    Hi, I have these data with 35 variables and want to run some regression(RF,xgboost, etc..) on it. I am new to R and want to know if you have any special online training that I can register for?

    • @JuliaSilge
      @JuliaSilge  Před 5 měsíci +1

      I recommend that you work through this:
      www.tidymodels.org/start/
      And then take a look at this book:
      www.tmwr.org/
      Good luck!

    • @omoniyitemitope6113
      @omoniyitemitope6113 Před 5 měsíci

      Thanks so much for your response. I followed one of your screencasts and got rsq of 0.37 for the RF model, is/are there anything I can do to improve the fit of my model?@@JuliaSilge

    • @JuliaSilge
      @JuliaSilge  Před 5 měsíci

      @@omoniyitemitope6113This definitely depends on the specifics of your situation! I recommend that you check out a resource like *Tidy Modeling with R* for digging deeper on the model building process: www.tmwr.org/

    • @omoniyitemitope6113
      @omoniyitemitope6113 Před 5 měsíci

      @@JuliaSilgeThanks for your response. I will go through it. I did something that I did not know the statistical implication. I took the log of my dependent variable and performed a RF, and to my surprise I got % var explained to be 99.74, this looks too good to be true to me

  • @danielhallriggins9008
    @danielhallriggins9008 Před 4 měsíci

    Thanks Julia, love your videos! To get a more accurate sense of performance, would it be helpful to use {spatialsample} to account for spatial autocorrelation?

    • @JuliaSilge
      @JuliaSilge  Před 4 měsíci +1

      That would be a great thing to do! This dataset doesn't have explicitly spatial information in it (just county FIPS code) so you would need to join some spatial info together with the original dataset.

  • @konormccracken
    @konormccracken Před rokem

    Always grateful for these videos! Though the grating little economist in me screamed a bit when you discounted the fixed-effect of "county" here 🫥

    • @JuliaSilge
      @JuliaSilge  Před rokem

      Ah yep! The xgboost algorithm does not have the ability to incorporate fixed effects the way that a multilevel model does, say like those from multilevelmod:
      multilevelmod.tidymodels.org/
      However, we could still use a resampling approach that takes into account how a given county is in this dataset a bunch of times, to avoid overly optimistic performance estimates. We'd want to switch out `initial_split()` for `group_initial_split()` and `validation_split()` for `group_validation_split()`:
      rsample.tidymodels.org/reference/validation_split.html