Video není dostupné.
Omlouváme se.

Lasso regression with tidymodels and The Office

Sdílet
Vložit
  • čas přidán 18. 08. 2024
  • Learn how to implement lasso regularized regression modeling in R using tidymodels and #TidyTuesday data on episodes of The Office.
    You can also check out David Robinson's screencast on this topic:
    • Tidy Tuesday screencas...
    As well as the code here on my blog:
    juliasilge.com...

Komentáře • 44

  • @RAPmastaGBLASCO63
    @RAPmastaGBLASCO63 Před 4 lety +5

    Every time I watch one of your videos I learn something new and become more confident in my modeling. Thank you so much for them!

  • @rrmaximiliano
    @rrmaximiliano Před 4 lety +8

    Thanks, Julia for the video. Really interesting how you approached the cleaning and models in comparison to David. Pretty nice you keep making these videos. They are super helpful.

  • @iugaMovil
    @iugaMovil Před 3 lety

    Great video Julia.
    It was a refresher for add_count and geom_col because I stop using them for some reason.

  • @erickknackstedt3131
    @erickknackstedt3131 Před 4 lety +1

    Love it! Finding this channel has made my day.

  • @hesamseraj
    @hesamseraj Před rokem

    I am reviewing all the videos and adding the tree episode names as some sort of homework for myself.

  • @minhnguyenbui6827
    @minhnguyenbui6827 Před 4 lety

    Oh wow, It's so amazing. I know you via Text mining with R book, Found David and your channel is a memorable milestone in my learning R process :D

  • @ethanthealien
    @ethanthealien Před 4 lety

    This was fantastic! It got me really excited about tidymodels =)

  • @luisfernandobaldanfechio8958

    Thanks a lot, excellent material. I'm having a different response from the fitted workflow (@ 27:00). I'm receiving a tibble: 31 x 3 with only one intecept while yours is a tibble 1,563 x 5 with many intercepts. I copy/paste the code as in my blog post.

    • @JuliaSilge
      @JuliaSilge  Před 3 lety

      Ah, I believe there has been a change in parsnip since this video was published that you only get the lambda you actually specified, not the whole path of lambdas: github.com/tidymodels/parsnip/blob/master/NEWS.md#parsnip-013

  • @alexnoble17
    @alexnoble17 Před 4 lety +2

    This is super interesting. I would love to do this analysis with Doctor Who (specifically New Who!)

  • @k.d0721
    @k.d0721 Před 4 lety

    you are the best, I should put your name in my PhD thesis

  • @brendang8610
    @brendang8610 Před 4 lety +2

    Awesome and informative video as always! I have a question and hope you can help clarify - I noticed when you did the bootstrap resampling you used office_train as the dataset, which is the unmodified training data. In another video (the hotel bookings one) you used the juiced recipe as the dataset when creating the monte carlo cross validation resamples. Is there a best practice on which dataset to use when resampling with tidymodels - the un-processed training data vs the pre-processed & juiced recipe data? Thanks!

    • @brendang8610
      @brendang8610 Před 4 lety

      oh! wait is it because here you're using a workflow() and in the hotel bookings video you weren't? and if so, is the workflow applying the recipe, prepping and juicing in the resampling step for you?

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +2

      @@brendang8610 Yes, that's basically it! A workflow that includes a recipe will apply that recipe. Generally it is probably better practice to do resampling on the unmodified training set, because otherwise you can get LEAKAGE from your preprocessing steps and then overly optimistic results from resampling.

  • @juliantagell1891
    @juliantagell1891 Před 4 lety +1

    Thanks Julie, this is great. Just got one question at 4:20.
    The other day I realised I can put pipes inside a mutate to get something like below... do you reckon using this is a good idea (I don't see it much but it feels really efficient)?
    transmute(episode_name = title %>% str_to_lower() %>% str_remove_all(remove_regex) %>% str_trim(),
    imdb_rating)

  • @AdrianaCastilloC
    @AdrianaCastilloC Před rokem

    Julia, this is great!! It's so well explained (: ... Do you know by any chance how to do exactly this for spatial (polygon) data?

    • @JuliaSilge
      @JuliaSilge  Před rokem

      You might check out the spatialsample package:
      spatialsample.tidymodels.org/
      And here is a blog post where I walk through how to use it:
      juliasilge.com/blog/drought-in-tx/

    • @AdrianaCastilloC
      @AdrianaCastilloC Před rokem

      @@JuliaSilge oh, my god! This is GREATTTT!!! many many thanks!!

  • @hesamseraj
    @hesamseraj Před 3 lety

    Amazing thank you very much.

  • @iqu3261
    @iqu3261 Před 2 lety

    Thanks so much Julia for the valuable videos, im trying to evaluate LDA topic modelling on tweets using NPMI , do you have an idea how to implement it in R? thanks Sam

  • @muttbane1072
    @muttbane1072 Před 4 lety

    Great video! Love it!

  • @vincentpepe1064
    @vincentpepe1064 Před 4 lety

    Hi Julia,
    Love the video! I was wondering how you would compare the accuracy of the model to the testing data? I need to submit a report with both the predicted and actual values and cannot seem to find it.

  • @TheFrankyguitar
    @TheFrankyguitar Před 4 lety

    Thanks for the great video Julia! I learned a lot. If we use a GLM, we might want to use a univariate filter to keep only relevant variables in the model since GLM's don't have built-in variable selection. Is there a way to do this with tidymodels? Maybe with recipes?

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +2

      Not currently, but we're interested in recipes supporting feature selection like that in the future!

    • @TheFrankyguitar
      @TheFrankyguitar Před 4 lety

      That's great! Thank you.

  • @hoschie211
    @hoschie211 Před 4 lety +1

    Very nice video! Well explained and above all: 30:18 :-)

  • @drinks3544
    @drinks3544 Před 2 lety

    What does the value used to indicate "importance" on the x-axis mean? is that R^2?

    • @JuliaSilge
      @JuliaSilge  Před 2 lety

      In the vip package, what "importance" is varies from model to model. You can look more at the documentation but for a linear model like a lasso regularized model, it is just literally the coefficients from the model itself (similar to coefficients from `lm()`). You can check out documentation for vip here:
      koalaverse.github.io/vip/

  • @mindlessgreen
    @mindlessgreen Před 3 lety

    Thanks for the nice tutorial. At 22:30, office_prep was created. What was that about? It was never used downstream. In general, I don't get the use of prep and bake.

    • @JuliaSilge
      @JuliaSilge  Před 3 lety +1

      I think it *is* useful to know how to use `prep()` and `bake()` if you are going to be a tidymodels user, in order to debug and problem solve when things don't go right with your recipes. It's a way to check out how your recipe will preprocess your data for modeling. You can read about what the two functions do here: www.tmwr.org/recipes.html#using-recipes

  • @vladimirmijatovic883
    @vladimirmijatovic883 Před rokem

    Hi @julia - great video!
    funny - I tried tuning hyperparameters with two different values of trees. when I tune the model with trees = 100 and with trees = 1000 the order of variable importance changes. With trees = 100 the most important variable is mhi_2018, followed by one_race_a, while with trees = 1000 the most important variable is one_race_a (followed by mhi_2018).
    How is this possible? From where this could be coming from?

    • @JuliaSilge
      @JuliaSilge  Před rokem

      I think you may be asking about a different video in this comment?
      But yes, maybe I should have been more clear that the variable importance I show is for *that model specifically*. The hyperparameters you choose for your algorithm often have an impact on variable importance. (And if you use variable importance to do feature selection, then that will change the hyperparameters you choose!) There is some related discussion here:
      stats.stackexchange.com/questions/264533/how-should-feature-selection-and-hyperparameter-optimization-be-ordered-in-the-m

    • @vladimirmijatovic883
      @vladimirmijatovic883 Před rokem

      @@JuliaSilge OMG, how embarrassing :), indeed it is related to another video of yours.
      The question was about this video: czcams.com/video/OMn1WCNufo8/video.html (Predict Childcare Costs), but CZcams kept rolling to next video while I was waiting for my model to be trained :).
      However, I was surprised that hyperparameter such as number of trees could impact order of variable importance. I guess my intuition was wrong.

  • @ryankirk574
    @ryankirk574 Před 4 lety

    What RStudio theme are you using? I could not find that in the default appearances.

    • @JuliaSilge
      @JuliaSilge  Před 4 lety

      It's one of the themes from the rsthemes package: github.com/gadenbuie/rsthemes

    • @ryankirk574
      @ryankirk574 Před 4 lety

      @@JuliaSilge Thank you for the quick reply! Watched and now reading through the blog explanation for further understanding.

  • @travisknoche5639
    @travisknoche5639 Před 4 lety

    Hi Julia, thanks for the video! I am getting the error: "All models failed in tune_grid(). See the `.notes` column." when running tune_grid(). My code is identical to yours and I'm also using a mac. Any ideas?

    • @travisknoche5639
      @travisknoche5639 Před 4 lety

      all of the .notes say "model 1/1 (predictions): Error in cbind2(1, newx) %*% nbeta: invalid class 'NA' to dup_mMatrix_as_dgeMatrix"

    • @JuliaSilge
      @JuliaSilge  Před 4 lety

      @@travisknoche5639 Is this using the same code/data as in my blog post? juliasilge.com/blog/lasso-the-office/ Or different data?

    • @travisknoche5639
      @travisknoche5639 Před 4 lety

      @@JuliaSilge Yep!

    • @JuliaSilge
      @JuliaSilge  Před 4 lety

      @@travisknoche5639 Does the first fit work, when you are not tuning?