Multinomial classification with tidymodels and volcano eruptions

Sdílet
Vložit
  • čas přidán 18. 08. 2024
  • If you have three or more categories in a classification outcome, you need to build a multiclass or multinomial classification model. Watch along to see how to do this in R using tidymodels and #TidyTuesday data on volcano eruptions!
    Check out the code on my blog: juliasilge.com...
  • Věda a technologie

Komentáře • 35

  • @pablotercero4860
    @pablotercero4860 Před 4 lety +1

    Amazing !! Thanks for sharing , I learn something incredibly useful every time, even tips and tricks.

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 Před rokem

    In regards to the tectonic settings it would be best to simply lump together all intraplate, all Rift zone, all Subduction to get three factor. Another approach is to group it by crust into categories oceanic and continental crust and intermediate crust. I think this would be better than simply tossing stuff.

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 Před rokem

    As a side comment. I worked on a project as undergraduate to determine the type of volcano that most likely had produced a particular mix of rock types. Based on this work (around 1977) we concluded that a particular mix of rock samples dredged of the coast Iceland originated from a central volcano or not (there was also gravity data and possibly paleomagnetic data).

  • @hesamseraj
    @hesamseraj Před 3 lety

    Thank you very much Julia.

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 Před 3 lety

    Very interesting project! I must say, however, as a geologist, that I would have been surprised if the data correlated with latitude and longitude. Volcano types are mostly linked to their tectonic settings. Shield volcanoes are almost exclusively linked to oceanic settings and hotspots such as Iceland or Hawaii and are dominated with basalts. Stratovolcanoes on the other hand are typically linked with andesite (and rhyolite) and are found around subduction zones. Most active volcanoes link up with plate boundaries and those boundaries have no relation to latitude or longitude.
    When I was an undergrad in Iceland I worked on volcanic rocks dredged off the seafloor near Iceland. My task was to identify the volcano type they were associated with based on the mix of rock types we collected at each site. I would have loved to access to the tools you are using here but alas those did not exist. It would have been much easier to infer the origin of these rocks.

  • @jonathanjayes
    @jonathanjayes Před 4 lety

    Thank you Julia! This was fascinating!

  • @lukasputtmann3590
    @lukasputtmann3590 Před 4 lety

    I really enjoyed this video! Thanks a lot.

  • @user-ld6rv4gu2t
    @user-ld6rv4gu2t Před 4 lety

    Thanks for the tutorial. Great.

  • @UsmanKhaliq10
    @UsmanKhaliq10 Před 4 lety

    thanks! this was a pretty cool tutorial.

  • @foobar4275
    @foobar4275 Před 3 lety

    @Julia: In the volcano_rec recipe I think there is a mistake. Minute mark ~21 - EDIT: I thought there was a mistake but it turns out there is no mistake, just a different way to handle a feature matrix with continuous and dummy variables.
    The issue is step_zv and step_normalize on all_predictors after creating dummy variables.
    In the recipe, dummy variables are created for tectonic_settings and major_rock_1. Then, all variables are passed to steps zero variance and normalization. I ran a quick simulation on my personal machine and the recipe as written would calculate the variance for the previously created dummy variables and standardize the dummy variables. EDIT: I thought that binary variables shouldn't be standardized but apparently there is some literature that suggests binary variables should be standardized (Tibshirani) or how to standardize continuous variables to approximate the scale of a [one-hot encoded] binary variable.
    I haven't finished the video yet so if you go back and correct this, I apologize. Otherwise, others be warned, those steps are wrong. One solution would be to do the step_zv and step_normalize before the dummy step as step_zv(all_numeric_predictors()) and step_normalize(all_numeric_predictors()). I've tested this and it works.

    • @JuliaSilge
      @JuliaSilge  Před 3 lety

      Well, it's not necessarily a "no-no" to center and scale dummy variables:
      community.rstudio.com/t/should-i-center-scale-dummy-variables/43212

    • @foobar4275
      @foobar4275 Před 3 lety

      @@JuliaSilge Thank you for sharing the link! =D I wasn't aware of Tibshirani's or Gelman's views on standardizing binary variables.

  • @christopheraloo5121
    @christopheraloo5121 Před 4 lety

    had a feeling population within kilometres would make for a good predictor since different types of volcanoes have different amount of footprint(used loosely)

  • @taiwankyh
    @taiwankyh Před 4 lety

    You suggest a good article for multi-classification; could you please spell the author or give the hyperlink? Thanks

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 Před 3 lety

    Julia,
    I am enjoying your videos tremendously. Currently I am focusing on the tidymodels. Do you have a suggestion for which order they should be watched in or are they each stand-alone?
    Thanks
    P. S. I have used skimr for sometime but recently it has stopped working? I have updated the version but no change. Any ideas?

    • @JuliaSilge
      @JuliaSilge  Před 3 lety

      I unfortunately haven't invested time at this point in putting the videos "in order"; they do vary in how advanced they are and I have tried to note in the descriptions which ones are better for folks just starting out with tidymodels. Sorry about that! They have been made somewhat organically week by week using Tidy Tuesday data.
      I haven't had any problems with skimr lately, but if you can create a reprex showing the problem, I'm sure the maintainers would be happy to see what is happening: github.com/ropensci/skimr/issues

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 Před 3 lety

      @@JuliaSilge Thanks - I understand. I do love the wholistic approach though of working through a project from beginning to end. That to me has been my main issue with places like DataCamp where you see more bite-site projects.

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 Před 3 lety

    Does the step_zv remove variables with perfect correlation? Possible confounding variables?

    • @JuliaSilge
      @JuliaSilge  Před 3 lety

      No, just those with zero variance: recipes.tidymodels.org/reference/step_zv.html
      You can filter out variables that are highly correlated with step_corr(): recipes.tidymodels.org/reference/step_corr.html

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 Před 3 lety

      Ah - thanks

  • @clarkevansteenderen7827

    Thank you for this awesome tutorial!!
    Does one only subset data into training and testing sets if there is a lot of data available? Or how do you decide whether to do that, or to just use bootstrapping on the original data as a whole, as you did in this example?

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +1

      Almost *always* you want to split into training/testing; this is the most important step in empirical model validation. The only time when you might not want to do this is when the available data is "pathologically" small, like this dataset of volcanoes.

    • @brodiegus2473
      @brodiegus2473 Před 3 lety

      I dont mean to be offtopic but does anybody know a tool to log back into an Instagram account?
      I was stupid lost the password. I love any assistance you can give me!

    • @everettleonel2844
      @everettleonel2844 Před 3 lety

      @Brodie Gus instablaster :)

    • @brodiegus2473
      @brodiegus2473 Před 3 lety

      @Everett Leonel i really appreciate your reply. I found the site through google and im waiting for the hacking stuff atm.
      I see it takes a while so I will get back to you later when my account password hopefully is recovered.

    • @brodiegus2473
      @brodiegus2473 Před 3 lety

      @Everett Leonel it worked and I actually got access to my account again. I am so happy!
      Thanks so much, you saved my ass!

  • @renanxcortes2
    @renanxcortes2 Před 4 lety

    Very cool video! Very didactic and informative! I wonder where in the code of tidymodels (or its dependencies) the predicted probabilities generated are corrected by the resampling strategy that the user uses (for example, oversampled some of the minority categories). Similarly as explained here: www.knime.com/blog/correcting-predicted-class-probabilities-in-imbalanced-datasets. Also, I think the metric was good, wasn't it? Because in this case the "Naive Guessing" would be 33,33% of probability and not 50%, therefore and AUC higher than 60% is already good, isn't it? Thank you so much again for posting this video!

  • @flamboyantperson5936
    @flamboyantperson5936 Před 4 lety

    You are amazing. could you please recommend someone like you who makes video in Python? It would be of great help.

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +1

      I really like Rachael Tatman's livestreams: www.twitch.tv/rctatman

    • @flamboyantperson5936
      @flamboyantperson5936 Před 4 lety

      @@JuliaSilge Thank you so much.

    • @flamboyantperson5936
      @flamboyantperson5936 Před 4 lety

      @@JuliaSilge Can I add you on facebook?

    • @JuliaSilge
      @JuliaSilge  Před 4 lety +1

      @@flamboyantperson5936 HA well I'm not on Facebook, actually.

    • @flamboyantperson5936
      @flamboyantperson5936 Před 4 lety

      @@JuliaSilge No Problem. There is a lot to learn from you but unfortunately my company is not working on R. I wish you could give the same knowledge in python. You are very very talented.