Julia Silge
Julia Silge
  • 79
  • 519 800
Mapping change in United State polling places
We observed Martin Luther King Day in the US this week and this week’s #TidyTuesday dataset focuses on polling places in honor of King’s work on voting rights. In this screencast, let’s use summarization and visualization to understand how the numbers of polling places in the US have changed. Check out the code on my blog: juliasilge.com/blog/polling-places
zhlédnutí: 1 696

Video

Empirical Bayes for Doctor Who episodes
zhlédnutí 2,8KPřed 8 měsíci
This week’s #TidyTuesday dataset is all about Doctor Who, celebrating the upcoming new episodes, and this screencast walks through how to use empirical Bayes to estimate ratings for different episode writers. The great thing about empirical Bayes is we can take into account the number of episodes each writer wrote. Check out the code on my blog: juliasilge.com/blog/doctor-who-bayes
Logistic regression for US House election vote share
zhlédnutí 2,8KPřed 9 měsíci
Today is Election Day and this week’s #TidyTuesday dataset is about elections for the US House of Representatives. This screencast demonstrates how to use logistic regression to understand vote share in these elections, highlighting how to use visualization for model interpretability and a matrix syntax for your model’s outcome (a good fit when you have proportion data). Check out the code on m...
Topic modeling for Taylor Swift Eras
zhlédnutí 3,2KPřed 9 měsíci
Last week’s #TidyTuesday dataset was about the songs of Taylor Swift, and this screencast demonstrates how to use topic modeling to learn how the text content of Taylor Swift’s work has changed through all her musical eras. Check out the code on my blog: juliasilge.com/blog/taylor-swift
Weighted log odds ratios for haunted places in the US
zhlédnutí 1,6KPřed 10 měsíci
It’s getting to be spooky season and this week’s #TidyTuesday dataset is about haunted locations in the United States. In this screencast, let’s use log odds ratios weighted via empirical Bayes to understand which US states are more likely to have haunted cemeteries and which are more likely to have haunted schools. Check out the code on my blog: juliasilge.com/blog/haunted-places
Bootstrap confidence intervals for how often Roy Kent says “F*CK”
zhlédnutí 2,8KPřed 10 měsíci
He's here, he's there, he's every f*cking where, and in this screencast, we use Poisson regression and bootstrap resampling to find confidence intervals for when Roy Kent uses colorful language more or less on the TV show Ted Lasso. This #TidyTuesday dataset was created by Deepsha Menghani for her recent talk at posit::conf. Check out the code on my blog: juliasilge.com/blog/roy-kent
Evaluate multiple ML approaches for spam detection
zhlédnutí 3,1KPřed 11 měsíci
In this screencast, we use tidymodels workflowsets to try out multiple modeling approaches for a #TidyTuesday dataset on spam email. We finish off with how to create a deployable model object and set up an API for our model. Check out the code on my blog: juliasilge.com/blog/spam-email
Evaluate the performance of GPT detectors
zhlédnutí 1,9KPřed rokem
This week’s #TidyTuesday is about detecting output from GPT language models, and specifically how these detectors perform differently for native and non-native English writers. In this screencast, learn how to evaluate classification models with tidymodels, using either predicted classes or predicted probabilities . Check out the code on my blog: juliasilge.com/blog/gpt-detectors
Byte pair encoding tokenization for geographical place names
zhlédnutí 2,1KPřed rokem
A recent #TidyTuesday makes available geographical place names in the US, and we can explore these names as text data. In this screencast, learn how to use byte pair encoding tokenization together with Poisson regression to find out which kinds of names are used more often and which are used less often. Check out the code on my blog: juliasilge.com/blog/place-names
Use xgboost and effect encodings to model tornadoes
zhlédnutí 3,7KPřed rokem
This week’s #TidyTuesday is about tornadoes in the US, and it provides a great opportunity to think about how we formulate a modeling approach in challenging circumstances. In this screencast, learn how to use xgboost with racing and effect encodings to predict the magnitude of tornadoes. Check out the code on my blog: juliasilge.com/blog/tornadoes
Predict childcare costs in US counties with xgboost and early stopping
zhlédnutí 3,5KPřed rokem
Mothers Day is coming up this weekend and this week’s #TidyTuesday is about childcare costs in the US. In this screencast, learn how to use xgboost with early stopping to predict the cost of childcare from other characteristics of a county like demographics and women’s earnings. Check out the code on my blog: juliasilge.com/blog/childcare-costs
Deploy a model on AWS SageMaker with vetiver
zhlédnutí 2,7KPřed rokem
AWS SageMaker is a fully managed machine learning service that lots of organizations use for their ML tasks, but it hasn’t always been easy for R users to go about their work on this platform. In this screencast, learn how to train and deploy a model with R and vetiver on SageMaker infrastructure. Check out the code on my blog: juliasilge.com/blog/vetiver-sagemaker
Use OpenAI text embeddings for horror movie descriptions
zhlédnutí 4,2KPřed rokem
High quality text embeddings are becoming more available from companies like OpenAI. This screencast walks through how to obtain and use such embeddings for #TidyTuesday data on horror movies. It’s always important to understand the limitations of text embeddings, such as how they reflect social biases. Check out the code on my blog: juliasilge.com/blog/horror-embeddings
Resampling to understand gender in art history textbooks
zhlédnutí 3KPřed rokem
Artists who are women have been underrepresented both in how art is displayed and studied, and we can use resampling together with #TidyTuesday data on art history textbooks to robustly understand more about this imbalance. Check out the code on my blog: juliasilge.com/blog/art-history
To downsample or not? Handling class imbalance in bird feeder observations
zhlédnutí 2,8KPřed rokem
Will squirrels will come eat from your bird feeder? Let's fit a model with #TidyTuesday data on bird feeders both with and without downsampling to find out. Check out the code on my blog: juliasilge.com/blog/project-feederwatch
How to handle high cardinality predictors for data on museums in the UK
zhlédnutí 6KPřed rokem
How to handle high cardinality predictors for data on museums in the UK
Find high FREX and high lift words in Stranger Things dialogue
zhlédnutí 2,6KPřed rokem
Find high FREX and high lift words in Stranger Things dialogue
Deploy different prediction types for a Bigfoot sighting model
zhlédnutí 2,7KPřed rokem
Deploy different prediction types for a Bigfoot sighting model
Deploy a model for LEGO sets with Docker
zhlédnutí 4,2KPřed rokem
Deploy a model for LEGO sets with Docker
Sliding window aggregation for rents in San Francisco
zhlédnutí 3,2KPřed 2 lety
Sliding window aggregation for rents in San Francisco
Understand the gender pay gap three ways
zhlédnutí 5KPřed 2 lety
Understand the gender pay gap three ways
Spatial resampling to understand drought in Texas
zhlédnutí 4,2KPřed 2 lety
Spatial resampling to understand drought in Texas
Predict NYT bestsellers with wordpiece tokenization
zhlédnutí 2,7KPřed 2 lety
Predict NYT bestsellers with wordpiece tokenization
Handling coefficients for modeling collegiate sports expenditures
zhlédnutí 3KPřed 2 lety
Handling coefficients for modeling collegiate sports expenditures
Poisson regression with tidymodels for package vignette counts
zhlédnutí 4,2KPřed 2 lety
Poisson regression with tidymodels for package vignette counts
Statistical inference for aircraft and rank of Tuskegee airmen
zhlédnutí 3,1KPřed 2 lety
Statistical inference for aircraft and rank of Tuskegee airmen
Feature engineering & interpretability for xgboost with board game ratings
zhlédnutí 7KPřed 2 lety
Feature engineering & interpretability for xgboost with board game ratings
Predict ratings for chocolate with tidymodels
zhlédnutí 5KPřed 2 lety
Predict ratings for chocolate with tidymodels
Topic modeling for Spice Girls lyrics
zhlédnutí 5KPřed 2 lety
Topic modeling for Spice Girls lyrics
Predicting viewership for Doctor Who episodes
zhlédnutí 3,2KPřed 2 lety
Predicting viewership for Doctor Who episodes

Komentáře

  • @enicay7562
    @enicay7562 Před dnem

    Thank you

  • @user-sb9oc3bm7u
    @user-sb9oc3bm7u Před 6 dny

    Would be amazing if you do a video using nested data (instead of having a nominal variable, nest it and generate a model for each of the levels for example), also using the map_workflow etc.. great as always!

  • @rafaelcallejo8367
    @rafaelcallejo8367 Před 19 dny

    buenos días, sus videos son excelentes, solo pedirle para futuros videos poder enfocar mas la cámara al código ya que se ve muy pequeño, disculpas por la sugerencia.

  • @dinohadjiyannis3225
    @dinohadjiyannis3225 Před 2 měsíci

    Julia, if I'm using a topic model on CZcams comments to determine which video best explains topic modeling, how can I decide if your video or another video should be suggested? I see the model ranks comments with "gamma." If each comment is linked to a video ID, and based on gamma some or all comments rank highly in a hypothetical "topic modeling" topic, what then ? can we infer that your video is the best ?

    • @JuliaSilge
      @JuliaSilge Před 2 měsíci

      HAHA I can't tell if this is serious or not 🙈 In case it is, I will say that since topic modeling is unsupervised ML, it can't be used in a straightforward way to evaluate better/worse (you are not predicting a label). Instead, like you say, you could compare the relative proportion of certain topics (like, say, a topic that seems to be mostly about topic modeling) in one video's comments compared to others, and make an evaluation of videos based on that.

    • @dinohadjiyannis3225
      @dinohadjiyannis3225 Před 2 měsíci

      ​@@JuliaSilge If I can "cluster" comments related to topic modeling and find that the most relevant ones are linked to your video ID (based on beta, which will give you the top word probabilities), your video will appear with the highest relevance to that topic (based on gamma). This means your video is the most representative of that specific topic. But wait.. Then, if I manually compare, say, the top 10 most relevant videos and see that your video (which is at the top) also has a lot of likes, comments, engagement, and perhaps a great sentiment (after computing it) compared to the other 9, I can conclude that your video is the "best" and would recommend it. Does this make sense, or am I misinterpreting the gamma/beta. ***Assume I have concatenated all comments into 1 corpora. Each corpora is linked to a video ID.

    • @JuliaSilge
      @JuliaSilge Před 2 měsíci

      @@dinohadjiyannis3225 I think that makes sense! Sounds to me like you are interpreting correctly. 👍

    • @dinohadjiyannis3225
      @dinohadjiyannis3225 Před 2 měsíci

      @@JuliaSilge A big thanks to you for replying, given that this video is 6 years old. 🥇

  • @rosiedavies7708
    @rosiedavies7708 Před 2 měsíci

    does this work in the same way with regression problems?

    • @rosiedavies7708
      @rosiedavies7708 Před 2 měsíci

      also thanks for this video, its very helpful and clear

    • @JuliaSilge
      @JuliaSilge Před 2 měsíci

      Yep, you would use `set_mode("regression")` in that case

  • @mxm8900
    @mxm8900 Před 2 měsíci

    Wow great video. I have nothing to do with text analysis, but I still watched the whole video

  • @smomar
    @smomar Před 2 měsíci

    All Hail the Dino! Now quickly get it some food, or else ... Thanks for the video. It was very informative.

  • @andreacierno4642
    @andreacierno4642 Před 3 měsíci

    Thank you Julia. Can this work if my version of 'type' has 5-8 categories? Where the final output is More like 'X' where 'X' is each category label? Is there a way to get more words in each prediction fold? So in the final output it could look like 3 words for each more like? Thank you, again.

    • @JuliaSilge
      @JuliaSilge Před 3 měsíci

      I recommend that you check out this chapter of my book with Emil Hvitfeldt: smltar.com/mlclassification#mlmulticlass

    • @andreacierno4642
      @andreacierno4642 Před 3 měsíci

      @@JuliaSilge Will do and thank you.

  • @emredunder9108
    @emredunder9108 Před 3 měsíci

    You are the queen of data analysis. Thanks for the video!

  • @kevingiang
    @kevingiang Před 3 měsíci

    Hi @JuliaSilge - thanks for your wonderful and helpful videos. I am trying to replicate your code with my own dataset and I am getting the following error when trying to initiate the tuning of the model: > xgb_rs <- + tune_race_anova( + object = xgb_wf, + resamples = dens_folds, + grid = 15, + control = control_race(verbose_elim = TRUE) + ) ℹ Evaluating against the initial 3 burn-in resamples. i Creating pre-processing data to finalize unknown parameter: mtry Error in `tune::tune_grid()`: ! Package install is required for xgboost. Run `rlang::last_trace()` to see where the error occurred. It says that a package install is required. Any idea about what package may be missing? I installed the 'tune' package and still gives me the same error. Any thoughts are appreciated. Thanks, Kevin

    • @JuliaSilge
      @JuliaSilge Před 3 měsíci

      It's the xgboost package that needs to be installed: CRAN.R-project.org/package=xgboost

    • @kevingiang
      @kevingiang Před 3 měsíci

      @@JuliaSilge I just figured it out... thanks much for answering back! You rock!

  • @gsonbiswas9765
    @gsonbiswas9765 Před 3 měsíci

    Nice explanation. You could have used the searchK() function to show us how to select the range for K.

  • @deltax7159
    @deltax7159 Před 3 měsíci

    What appearance theme are you using here?

    • @JuliaSilge
      @JuliaSilge Před 3 měsíci

      I use one of the themes from rsthemes: www.garrickadenbuie.com/project/rsthemes/ I think Oceanic Plus? There are lots of nice ones available in that package.

  • @danielhallriggins9008
    @danielhallriggins9008 Před 4 měsíci

    Thanks Julia, love your videos! To get a more accurate sense of performance, would it be helpful to use {spatialsample} to account for spatial autocorrelation?

    • @JuliaSilge
      @JuliaSilge Před 4 měsíci

      That would be a great thing to do! This dataset doesn't have explicitly spatial information in it (just county FIPS code) so you would need to join some spatial info together with the original dataset.

  • @AnkeetSingh-gt9fm
    @AnkeetSingh-gt9fm Před 4 měsíci

    Hey Julia, great tutorial. I had a question. Here you used Subject_Matter as the only high cardinality variable. If we have a dataset where there are multiple columns with high cardinality, can the recipe method be used in such a case for all the high cardinality columns ?

    • @JuliaSilge
      @JuliaSilge Před 4 měsíci

      Yes, you sure can! You will need to keep in mind how much data you have vs. how many predictors you are trying to encode in this way, and definitely keep in mind that you are using the **outcome** in your feature engineering. You can read more here: www.tmwr.org/categorical

    • @AnkeetSingh-gt9fm
      @AnkeetSingh-gt9fm Před 4 měsíci

      @@JuliaSilge Great I’ll keep that in mind. Thank you!

    • @AnkeetSingh-gt9fm
      @AnkeetSingh-gt9fm Před 4 měsíci

      @@JuliaSilge Hi, I had another question with regards to my previous question. For each column, would we have to define a separate recipe? And while creating the workflow, how would you add the recipes for multiple columns in the workflow(since workflow only allows one recipe)? I was unable to find resources for this online. Any help would be appreciated!

    • @JuliaSilge
      @JuliaSilge Před 4 měsíci

      @@AnkeetSingh-gt9fm Oh, you don't need a separate recipe for different columns, just separate steps. So you could do `step_lencode_glm()` then pipe to another `step_lencode_glm()`, etc.

    • @AnkeetSingh-gt9fm
      @AnkeetSingh-gt9fm Před 4 měsíci

      @@JuliaSilge Thank you that’s what I figured and ran the code. I received an error: Error in - dsy2dpoC.Msymfrom) : not a positive definite matrix (and positive semidefiniteness is not checked), looks like I need to assess some variables in my model. You are very helpful with your prompt replies, I really appreciate it. Thank you!

  • @olexiypukhov-KT
    @olexiypukhov-KT Před 4 měsíci

    I always love your videos Julia! I learn so much every time. Thank you for all the screencasts! Hopefully you haven't stopped and I am looking forward for more!

  • @Ejnota
    @Ejnota Před 5 měsíci

    With a left join you keep all the rows

  • @cb5231
    @cb5231 Před 5 měsíci

    thanks for this video Julia <3

  • @omoniyitemitope6113
    @omoniyitemitope6113 Před 5 měsíci

    Hi, I have these data with 35 variables and want to run some regression(RF,xgboost, etc..) on it. I am new to R and want to know if you have any special online training that I can register for?

    • @JuliaSilge
      @JuliaSilge Před 5 měsíci

      I recommend that you work through this: www.tidymodels.org/start/ And then take a look at this book: www.tmwr.org/ Good luck!

    • @omoniyitemitope6113
      @omoniyitemitope6113 Před 5 měsíci

      Thanks so much for your response. I followed one of your screencasts and got rsq of 0.37 for the RF model, is/are there anything I can do to improve the fit of my model?@@JuliaSilge

    • @JuliaSilge
      @JuliaSilge Před 5 měsíci

      @@omoniyitemitope6113This definitely depends on the specifics of your situation! I recommend that you check out a resource like *Tidy Modeling with R* for digging deeper on the model building process: www.tmwr.org/

    • @omoniyitemitope6113
      @omoniyitemitope6113 Před 5 měsíci

      @@JuliaSilgeThanks for your response. I will go through it. I did something that I did not know the statistical implication. I took the log of my dependent variable and performed a RF, and to my surprise I got % var explained to be 99.74, this looks too good to be true to me

  • @mohamedhany2513
    @mohamedhany2513 Před 5 měsíci

    could you make a video explaining how to deploy a model with shiny

    • @JuliaSilge
      @JuliaSilge Před 5 měsíci

      You may find this demo from Posit Solutions Engineering helpful: solutions.posit.co/gallery/bike_predict/

  • @zapbesttowatch2660
    @zapbesttowatch2660 Před 6 měsíci

    good explanation

  • @eileenmurphy7044
    @eileenmurphy7044 Před 6 měsíci

    Thank you Julia for another excellent video. I have been trying to replicate your methods of generating a Canadian Provincial Map with the outlines of the provinces. For some reason map_data doesn't include borders to provinces the way that the states is set up. Do you have any ideas, how I can use map_data to do this? I am using map_data("world", "Canada") to get the provinces.

    • @eileenmurphy7044
      @eileenmurphy7044 Před 6 měsíci

      Hi Julia, Just answered my own question. The R package mapcan works like map_data - except it includes Canadian provincial borders.

    • @JuliaSilge
      @JuliaSilge Před 6 měsíci

      @@eileenmurphy7044 Ah, that is good to hear! 🙌

  • @nabereon
    @nabereon Před 6 měsíci

    This channel is pure gold.

  • @teorems
    @teorems Před 7 měsíci

    Bitters are good for health!

  • @G2Mexpert
    @G2Mexpert Před 7 měsíci

    After recently having hacked my way through a similar attempt to visualize data by state, this was a like receiving "the answer sheet" from your teacher. Really appreciate the smart use of usmap information, building the tibble to facilitate the join of state abbr to state(lower) and the ever-useful window commands in dplyr!

  • @Y45HV1N
    @Y45HV1N Před 7 měsíci

    (I'm very very new to all this) Getting the prior and the empirical/posterior from the same data seems counterintuitive to me and a bit like confirmation bias.

    • @JuliaSilge
      @JuliaSilge Před 7 měsíci

      If you'd like some more conceptual background and theory behind this, I recommend the writing that Bradley Efron has done on it, the 1985 paper by Casella, and, for a more practical approach, the book _Introduction to Empirical Bayes: Examples from Baseball Statistics_ by my collaborator David Robinson.

  • @CaribouDataScience
    @CaribouDataScience Před 7 měsíci

    Thanks, another investing video.

  • @angvl8793
    @angvl8793 Před 7 měsíci

    We see Julia we click Like !! :)

  • @KK-tt5jz
    @KK-tt5jz Před 7 měsíci

    brilliant work, thank you Julia!

  • @rayflyers
    @rayflyers Před 7 měsíci

    I've been watching your text mining videos recently. I'm about to start a project mining case notes for info on clients' relatives. I'm hoping to build a model that can predict if a case note contains that info or not. Any tips would be appreciated!

    • @JuliaSilge
      @JuliaSilge Před 7 měsíci

      That sounds like it may be doable! Overall, I recommend this book I wrote with Emil for advice on predictive modeling for text data: smltar.com/

  • @elvinceager
    @elvinceager Před 7 měsíci

    Yes, new video. its been a while. love this content.

  • @trevorschrotz
    @trevorschrotz Před 7 měsíci

    Excellent, thank you.

  • @Adeyeye_seyison
    @Adeyeye_seyison Před 7 měsíci

    As Usual, Brilliant work from a brilliant data scientist...

  • @forheuristiclifeksh7836
    @forheuristiclifeksh7836 Před 7 měsíci

    0:08

  • @wapsyed
    @wapsyed Před 7 měsíci

    Your videos are therapeutic haha

  • @mocabeentrill
    @mocabeentrill Před 7 měsíci

    Clear explained and direct to the point! Thank you Julia.

  • @trevorschrotz
    @trevorschrotz Před 7 měsíci

    Thanks for the tip on using type.predict = "response" in the broom::augment function. I learn something new from each of your vides, so thanks for the work that you put into making these.

  • @ColonelHathi
    @ColonelHathi Před 7 měsíci

    YES! This is evidence supporting my hypothesis that Jodie Whittaker is a great actress, but Chris Chibnall is a terrible writer 😅. It isn't conclusive obviously. We would need observations from when they worked separately to get better proof.

  • @reshmilb2527
    @reshmilb2527 Před 7 měsíci

    please avoid background colour black. use eyesight friendly colour

  • @Jakan-sf3xj
    @Jakan-sf3xj Před 8 měsíci

    Thank you for the great video. I have one question, assuming the best model was one of the tuned random forest models, how would we extract the parsnip object to see the tuned hyperparameters i.e mtry and min_n?

    • @JuliaSilge
      @JuliaSilge Před 8 měsíci

      You might check out the different "extract" functions in tidymodels. You can do `extract_fit_parsnip()` but you can also do `extract_parameter_set_dials()` to get the hyperparameters directly: hardhat.tidymodels.org/reference/hardhat-extract.html

  • @user-sb9oc3bm7u
    @user-sb9oc3bm7u Před 8 měsíci

    Hey Julia. Writing you here although its not related to this specific video. I am using the tidylo::bind log odds for a project but the fact that the order of set/feature is different than the one in tidytext::bind tf idf (which requires term and then document) makes it hard to easily encode it for shiny (with f <- type_of_algorithm; f(docs/sets, terms/fts, n=n), for example). Any chance you change the tidylo order? Obviously it can be done with a simple if(){} else{} but its much cleaner to use the `f <-select_algo` approach :)

    • @JuliaSilge
      @JuliaSilge Před 8 měsíci

      Can you open an issue over at tidylo with an example/reprex showing what you mean? github.com/juliasilge/tidylo/issues

  • @Jackeeba
    @Jackeeba Před 8 měsíci

    The extra brackets of copilot can be really annoying! When I turn it on I've started to use enter for RStudio's autocomplete (tab just takes copilot's often incorrect suggestion), does anyone have a better solution?

    • @JuliaSilge
      @JuliaSilge Před 8 měsíci

      Looks like the RStudio team is tracking the extra parentheses here: github.com/rstudio/rstudio/issues/13953 Feel free to thumbs up or add additional detail!

  • @manueltiburtini6528
    @manueltiburtini6528 Před 8 měsíci

    Amazing analysis! :O

  • @dantshisungu395
    @dantshisungu395 Před 8 měsíci

    Great episode like always Just a little whim from my side: would you mind doing videos about TDA real-life application ? And how can implement Spark on models that aren't on Parnsip ? Thank you 😅

  • @matthewcarter1624
    @matthewcarter1624 Před 8 měsíci

    Hey Julia, thanks for you video ! I really like these.. I had one question about the std_var per writer... why did you divided it by the number of observations ?

    • @ariskoitsanos607
      @ariskoitsanos607 Před 8 měsíci

      Hey hi, it's cos the standard error of the average is sigma/sqrt(n), so the variance of the average would be sigma^2/n

  • @soylentpink7845
    @soylentpink7845 Před 8 měsíci

    Interesting! Could you do more practical applications of Bayesian statistics. I see it more and more asked and required by bigger tech companies.

  • @AlexLabuda
    @AlexLabuda Před 8 měsíci

    So fun! Thanks for the video

  • @wilrivera2987
    @wilrivera2987 Před 8 měsíci

    Big fan of Dr Who too

  • @nosinz753
    @nosinz753 Před 8 měsíci

    adding this into my EDA tool box

  • @sr4823
    @sr4823 Před 8 měsíci

    Sometimes copilot feels more like a burden than a companion. Thanks Julia, I always learn a ton with these videos.

  • @manueltiburtini6528
    @manueltiburtini6528 Před 9 měsíci

    Amazing work! you're so inspiring, thanks for sharing!