79
519 800

24:05

Logistic regression for US House election vote share

25:13

Topic modeling for Taylor Swift Eras

25:10

Weighted log odds ratios for haunted places in the US

17:58

Bootstrap confidence intervals for how often Roy Kent says “F*CK”

13:18

Evaluate multiple ML approaches for spam detection

30:36

Mapping change in United State polling places

We observed Martin Luther King Day in the US this week and this week’s #TidyTuesday dataset focuses on polling places in honor of King’s work on voting rights. In this screencast, let’s use summarization and visualization to understand how the numbers of polling places in the US have changed. Check out the code on my blog: juliasilge.com/blog/polling-places

zhlédnutí: 1 696

Video

24:05

Empirical Bayes for Doctor Who episodes

zhlédnutí 2,8KPřed 8 měsíci

This week’s #TidyTuesday dataset is all about Doctor Who, celebrating the upcoming new episodes, and this screencast walks through how to use empirical Bayes to estimate ratings for different episode writers. The great thing about empirical Bayes is we can take into account the number of episodes each writer wrote. Check out the code on my blog: juliasilge.com/blog/doctor-who-bayes

Logistic regression for US House election vote share

25:13

Logistic regression for US House election vote share

zhlédnutí 2,8KPřed 9 měsíci

Today is Election Day and this week’s #TidyTuesday dataset is about elections for the US House of Representatives. This screencast demonstrates how to use logistic regression to understand vote share in these elections, highlighting how to use visualization for model interpretability and a matrix syntax for your model’s outcome (a good fit when you have proportion data). Check out the code on m...

25:10

Topic modeling for Taylor Swift Eras

zhlédnutí 3,2KPřed 9 měsíci

Last week’s #TidyTuesday dataset was about the songs of Taylor Swift, and this screencast demonstrates how to use topic modeling to learn how the text content of Taylor Swift’s work has changed through all her musical eras. Check out the code on my blog: juliasilge.com/blog/taylor-swift

Weighted log odds ratios for haunted places in the US

17:58

Weighted log odds ratios for haunted places in the US

zhlédnutí 1,6KPřed 10 měsíci

It’s getting to be spooky season and this week’s #TidyTuesday dataset is about haunted locations in the United States. In this screencast, let’s use log odds ratios weighted via empirical Bayes to understand which US states are more likely to have haunted cemeteries and which are more likely to have haunted schools. Check out the code on my blog: juliasilge.com/blog/haunted-places

Bootstrap confidence intervals for how often Roy Kent says “F*CK”

13:18

Bootstrap confidence intervals for how often Roy Kent says “F*CK”

zhlédnutí 2,8KPřed 10 měsíci

He's here, he's there, he's every f*cking where, and in this screencast, we use Poisson regression and bootstrap resampling to find confidence intervals for when Roy Kent uses colorful language more or less on the TV show Ted Lasso. This #TidyTuesday dataset was created by Deepsha Menghani for her recent talk at posit::conf. Check out the code on my blog: juliasilge.com/blog/roy-kent

Evaluate multiple ML approaches for spam detection

30:36

Evaluate multiple ML approaches for spam detection

zhlédnutí 3,1KPřed 11 měsíci

In this screencast, we use tidymodels workflowsets to try out multiple modeling approaches for a #TidyTuesday dataset on spam email. We finish off with how to create a deployable model object and set up an API for our model. Check out the code on my blog: juliasilge.com/blog/spam-email

Evaluate the performance of GPT detectors

28:41

Evaluate the performance of GPT detectors

zhlédnutí 1,9KPřed rokem

This week’s #TidyTuesday is about detecting output from GPT language models, and specifically how these detectors perform differently for native and non-native English writers. In this screencast, learn how to evaluate classification models with tidymodels, using either predicted classes or predicted probabilities . Check out the code on my blog: juliasilge.com/blog/gpt-detectors

Byte pair encoding tokenization for geographical place names

21:32

Byte pair encoding tokenization for geographical place names

zhlédnutí 2,1KPřed rokem

A recent #TidyTuesday makes available geographical place names in the US, and we can explore these names as text data. In this screencast, learn how to use byte pair encoding tokenization together with Poisson regression to find out which kinds of names are used more often and which are used less often. Check out the code on my blog: juliasilge.com/blog/place-names

Use xgboost and effect encodings to model tornadoes

37:36

Use xgboost and effect encodings to model tornadoes

zhlédnutí 3,7KPřed rokem

This week’s #TidyTuesday is about tornadoes in the US, and it provides a great opportunity to think about how we formulate a modeling approach in challenging circumstances. In this screencast, learn how to use xgboost with racing and effect encodings to predict the magnitude of tornadoes. Check out the code on my blog: juliasilge.com/blog/tornadoes

Predict childcare costs in US counties with xgboost and early stopping

36:40

Predict childcare costs in US counties with xgboost and early stopping

zhlédnutí 3,5KPřed rokem

Mothers Day is coming up this weekend and this week’s #TidyTuesday is about childcare costs in the US. In this screencast, learn how to use xgboost with early stopping to predict the cost of childcare from other characteristics of a county like demographics and women’s earnings. Check out the code on my blog: juliasilge.com/blog/childcare-costs

Deploy a model on AWS SageMaker with vetiver

23:08

Deploy a model on AWS SageMaker with vetiver

zhlédnutí 2,7KPřed rokem

AWS SageMaker is a fully managed machine learning service that lots of organizations use for their ML tasks, but it hasn’t always been easy for R users to go about their work on this platform. In this screencast, learn how to train and deploy a model with R and vetiver on SageMaker infrastructure. Check out the code on my blog: juliasilge.com/blog/vetiver-sagemaker

Use OpenAI text embeddings for horror movie descriptions

29:18

Use OpenAI text embeddings for horror movie descriptions

zhlédnutí 4,2KPřed rokem

High quality text embeddings are becoming more available from companies like OpenAI. This screencast walks through how to obtain and use such embeddings for #TidyTuesday data on horror movies. It’s always important to understand the limitations of text embeddings, such as how they reflect social biases. Check out the code on my blog: juliasilge.com/blog/horror-embeddings

Resampling to understand gender in art history textbooks

27:13

Resampling to understand gender in art history textbooks

zhlédnutí 3KPřed rokem

Artists who are women have been underrepresented both in how art is displayed and studied, and we can use resampling together with #TidyTuesday data on art history textbooks to robustly understand more about this imbalance. Check out the code on my blog: juliasilge.com/blog/art-history

To downsample or not? Handling class imbalance in bird feeder observations

45:46

To downsample or not? Handling class imbalance in bird feeder observations

zhlédnutí 2,8KPřed rokem

Will squirrels will come eat from your bird feeder? Let's fit a model with #TidyTuesday data on bird feeders both with and without downsampling to find out. Check out the code on my blog: juliasilge.com/blog/project-feederwatch

How to handle high cardinality predictors for data on museums in the UK

36:22

How to handle high cardinality predictors for data on museums in the UK

zhlédnutí 6KPřed rokem

How to handle high cardinality predictors for data on museums in the UK

Find high FREX and high lift words in Stranger Things dialogue

30:03

Find high FREX and high lift words in Stranger Things dialogue

zhlédnutí 2,6KPřed rokem

Find high FREX and high lift words in Stranger Things dialogue

Deploy different prediction types for a Bigfoot sighting model

29:27

Deploy different prediction types for a Bigfoot sighting model

zhlédnutí 2,7KPřed rokem

Deploy different prediction types for a Bigfoot sighting model

Deploy a model for LEGO sets with Docker

27:28

Deploy a model for LEGO sets with Docker

zhlédnutí 4,2KPřed rokem

Deploy a model for LEGO sets with Docker

Sliding window aggregation for rents in San Francisco

15:34

Sliding window aggregation for rents in San Francisco

zhlédnutí 3,2KPřed 2 lety

Sliding window aggregation for rents in San Francisco

Understand the gender pay gap three ways

32:38

Understand the gender pay gap three ways

zhlédnutí 5KPřed 2 lety

Understand the gender pay gap three ways

Spatial resampling to understand drought in Texas

22:44

Spatial resampling to understand drought in Texas

zhlédnutí 4,2KPřed 2 lety

Spatial resampling to understand drought in Texas

Predict NYT bestsellers with wordpiece tokenization

30:49

Predict NYT bestsellers with wordpiece tokenization

zhlédnutí 2,7KPřed 2 lety

Predict NYT bestsellers with wordpiece tokenization

Handling coefficients for modeling collegiate sports expenditures

31:47

Handling coefficients for modeling collegiate sports expenditures

zhlédnutí 3KPřed 2 lety

Handling coefficients for modeling collegiate sports expenditures

Poisson regression with tidymodels for package vignette counts

30:55

Poisson regression with tidymodels for package vignette counts

zhlédnutí 4,2KPřed 2 lety

Poisson regression with tidymodels for package vignette counts

Statistical inference for aircraft and rank of Tuskegee airmen

23:28

Statistical inference for aircraft and rank of Tuskegee airmen

zhlédnutí 3,1KPřed 2 lety

Statistical inference for aircraft and rank of Tuskegee airmen

Feature engineering & interpretability for xgboost with board game ratings

47:44

Feature engineering & interpretability for xgboost with board game ratings

zhlédnutí 7KPřed 2 lety

Feature engineering & interpretability for xgboost with board game ratings

Predict ratings for chocolate with tidymodels

37:35

Predict ratings for chocolate with tidymodels

zhlédnutí 5KPřed 2 lety

Predict ratings for chocolate with tidymodels

30:29

Topic modeling for Spice Girls lyrics

zhlédnutí 5KPřed 2 lety

Topic modeling for Spice Girls lyrics

Predicting viewership for Doctor Who episodes

22:04

Predicting viewership for Doctor Who episodes

zhlédnutí 3,2KPřed 2 lety

Predicting viewership for Doctor Who episodes

Komentáře

@enicay7562 Před dnem
Thank you
@user-sb9oc3bm7u Před 6 dny
Would be amazing if you do a video using nested data (instead of having a nominal variable, nest it and generate a model for each of the levels for example), also using the map_workflow etc.. great as always!
@rafaelcallejo8367 Před 19 dny
buenos días, sus videos son excelentes, solo pedirle para futuros videos poder enfocar mas la cámara al código ya que se ve muy pequeño, disculpas por la sugerencia.
@dinohadjiyannis3225 Před 2 měsíci
Julia, if I'm using a topic model on CZcams comments to determine which video best explains topic modeling, how can I decide if your video or another video should be suggested? I see the model ranks comments with "gamma." If each comment is linked to a video ID, and based on gamma some or all comments rank highly in a hypothetical "topic modeling" topic, what then ? can we infer that your video is the best ?
@JuliaSilge Před 2 měsíci
HAHA I can't tell if this is serious or not 🙈 In case it is, I will say that since topic modeling is unsupervised ML, it can't be used in a straightforward way to evaluate better/worse (you are not predicting a label). Instead, like you say, you could compare the relative proportion of certain topics (like, say, a topic that seems to be mostly about topic modeling) in one video's comments compared to others, and make an evaluation of videos based on that.
@dinohadjiyannis3225 Před 2 měsíci
@@JuliaSilge If I can "cluster" comments related to topic modeling and find that the most relevant ones are linked to your video ID (based on beta, which will give you the top word probabilities), your video will appear with the highest relevance to that topic (based on gamma). This means your video is the most representative of that specific topic. But wait.. Then, if I manually compare, say, the top 10 most relevant videos and see that your video (which is at the top) also has a lot of likes, comments, engagement, and perhaps a great sentiment (after computing it) compared to the other 9, I can conclude that your video is the "best" and would recommend it. Does this make sense, or am I misinterpreting the gamma/beta. ***Assume I have concatenated all comments into 1 corpora. Each corpora is linked to a video ID.
@JuliaSilge Před 2 měsíci
@@dinohadjiyannis3225 I think that makes sense! Sounds to me like you are interpreting correctly. 👍
@dinohadjiyannis3225 Před 2 měsíci
@@JuliaSilge A big thanks to you for replying, given that this video is 6 years old. 🥇
@rosiedavies7708 Před 2 měsíci
does this work in the same way with regression problems?
@rosiedavies7708 Před 2 měsíci
also thanks for this video, its very helpful and clear
@JuliaSilge Před 2 měsíci
Yep, you would use `set_mode("regression")` in that case
@mxm8900 Před 2 měsíci
Wow great video. I have nothing to do with text analysis, but I still watched the whole video
@smomar Před 2 měsíci
All Hail the Dino! Now quickly get it some food, or else ... Thanks for the video. It was very informative.
@andreacierno4642 Před 3 měsíci
Thank you Julia. Can this work if my version of 'type' has 5-8 categories? Where the final output is More like 'X' where 'X' is each category label? Is there a way to get more words in each prediction fold? So in the final output it could look like 3 words for each more like? Thank you, again.
@JuliaSilge Před 3 měsíci
I recommend that you check out this chapter of my book with Emil Hvitfeldt: smltar.com/mlclassification#mlmulticlass
@andreacierno4642 Před 3 měsíci
@@JuliaSilge Will do and thank you.
@emredunder9108 Před 3 měsíci
You are the queen of data analysis. Thanks for the video!
@kevingiang Před 3 měsíci
Hi @JuliaSilge - thanks for your wonderful and helpful videos. I am trying to replicate your code with my own dataset and I am getting the following error when trying to initiate the tuning of the model: > xgb_rs <- + tune_race_anova( + object = xgb_wf, + resamples = dens_folds, + grid = 15, + control = control_race(verbose_elim = TRUE) + ) ℹ Evaluating against the initial 3 burn-in resamples. i Creating pre-processing data to finalize unknown parameter: mtry Error in `tune::tune_grid()`: ! Package install is required for xgboost. Run `rlang::last_trace()` to see where the error occurred. It says that a package install is required. Any idea about what package may be missing? I installed the 'tune' package and still gives me the same error. Any thoughts are appreciated. Thanks, Kevin
@JuliaSilge Před 3 měsíci
It's the xgboost package that needs to be installed: CRAN.R-project.org/package=xgboost
@kevingiang Před 3 měsíci
@@JuliaSilge I just figured it out... thanks much for answering back! You rock!
@gsonbiswas9765 Před 3 měsíci
Nice explanation. You could have used the searchK() function to show us how to select the range for K.
@deltax7159 Před 3 měsíci
What appearance theme are you using here?
@JuliaSilge Před 3 měsíci
I use one of the themes from rsthemes: www.garrickadenbuie.com/project/rsthemes/ I think Oceanic Plus? There are lots of nice ones available in that package.
@danielhallriggins9008 Před 4 měsíci
Thanks Julia, love your videos! To get a more accurate sense of performance, would it be helpful to use {spatialsample} to account for spatial autocorrelation?
@JuliaSilge Před 4 měsíci
That would be a great thing to do! This dataset doesn't have explicitly spatial information in it (just county FIPS code) so you would need to join some spatial info together with the original dataset.
@AnkeetSingh-gt9fm Před 4 měsíci
Hey Julia, great tutorial. I had a question. Here you used Subject_Matter as the only high cardinality variable. If we have a dataset where there are multiple columns with high cardinality, can the recipe method be used in such a case for all the high cardinality columns ?
@JuliaSilge Před 4 měsíci
Yes, you sure can! You will need to keep in mind how much data you have vs. how many predictors you are trying to encode in this way, and definitely keep in mind that you are using the **outcome** in your feature engineering. You can read more here: www.tmwr.org/categorical
@AnkeetSingh-gt9fm Před 4 měsíci
@@JuliaSilge Great I’ll keep that in mind. Thank you!
@AnkeetSingh-gt9fm Před 4 měsíci
@@JuliaSilge Hi, I had another question with regards to my previous question. For each column, would we have to define a separate recipe? And while creating the workflow, how would you add the recipes for multiple columns in the workflow(since workflow only allows one recipe)? I was unable to find resources for this online. Any help would be appreciated!
@JuliaSilge Před 4 měsíci
@@AnkeetSingh-gt9fm Oh, you don't need a separate recipe for different columns, just separate steps. So you could do `step_lencode_glm()` then pipe to another `step_lencode_glm()`, etc.
@AnkeetSingh-gt9fm Před 4 měsíci
@@JuliaSilge Thank you that’s what I figured and ran the code. I received an error: Error in - dsy2dpoC.Msymfrom) : not a positive definite matrix (and positive semidefiniteness is not checked), looks like I need to assess some variables in my model. You are very helpful with your prompt replies, I really appreciate it. Thank you!
@olexiypukhov-KT Před 4 měsíci
I always love your videos Julia! I learn so much every time. Thank you for all the screencasts! Hopefully you haven't stopped and I am looking forward for more!
@Ejnota Před 5 měsíci
With a left join you keep all the rows
@cb5231 Před 5 měsíci
thanks for this video Julia <3
@omoniyitemitope6113 Před 5 měsíci
Hi, I have these data with 35 variables and want to run some regression(RF,xgboost, etc..) on it. I am new to R and want to know if you have any special online training that I can register for?
@JuliaSilge Před 5 měsíci
I recommend that you work through this: www.tidymodels.org/start/ And then take a look at this book: www.tmwr.org/ Good luck!
@omoniyitemitope6113 Před 5 měsíci
Thanks so much for your response. I followed one of your screencasts and got rsq of 0.37 for the RF model, is/are there anything I can do to improve the fit of my model?@@JuliaSilge
@JuliaSilge Před 5 měsíci
@@omoniyitemitope6113This definitely depends on the specifics of your situation! I recommend that you check out a resource like *Tidy Modeling with R* for digging deeper on the model building process: www.tmwr.org/
@omoniyitemitope6113 Před 5 měsíci
@@JuliaSilgeThanks for your response. I will go through it. I did something that I did not know the statistical implication. I took the log of my dependent variable and performed a RF, and to my surprise I got % var explained to be 99.74, this looks too good to be true to me
@mohamedhany2513 Před 5 měsíci
could you make a video explaining how to deploy a model with shiny
@JuliaSilge Před 5 měsíci
You may find this demo from Posit Solutions Engineering helpful: solutions.posit.co/gallery/bike_predict/
@zapbesttowatch2660 Před 6 měsíci
good explanation
@eileenmurphy7044 Před 6 měsíci
Thank you Julia for another excellent video. I have been trying to replicate your methods of generating a Canadian Provincial Map with the outlines of the provinces. For some reason map_data doesn't include borders to provinces the way that the states is set up. Do you have any ideas, how I can use map_data to do this? I am using map_data("world", "Canada") to get the provinces.
@eileenmurphy7044 Před 6 měsíci
Hi Julia, Just answered my own question. The R package mapcan works like map_data - except it includes Canadian provincial borders.
@JuliaSilge Před 6 měsíci
@@eileenmurphy7044 Ah, that is good to hear! 🙌
@nabereon Před 6 měsíci
This channel is pure gold.
@teorems Před 7 měsíci
Bitters are good for health!
@G2Mexpert Před 7 měsíci
After recently having hacked my way through a similar attempt to visualize data by state, this was a like receiving "the answer sheet" from your teacher. Really appreciate the smart use of usmap information, building the tibble to facilitate the join of state abbr to state(lower) and the ever-useful window commands in dplyr!
@Y45HV1N Před 7 měsíci
(I'm very very new to all this) Getting the prior and the empirical/posterior from the same data seems counterintuitive to me and a bit like confirmation bias.
@JuliaSilge Před 7 měsíci
If you'd like some more conceptual background and theory behind this, I recommend the writing that Bradley Efron has done on it, the 1985 paper by Casella, and, for a more practical approach, the book _Introduction to Empirical Bayes: Examples from Baseball Statistics_ by my collaborator David Robinson.
@CaribouDataScience Před 7 měsíci
Thanks, another investing video.
@angvl8793 Před 7 měsíci
We see Julia we click Like !! :)
@KK-tt5jz Před 7 měsíci
brilliant work, thank you Julia!
@rayflyers Před 7 měsíci
I've been watching your text mining videos recently. I'm about to start a project mining case notes for info on clients' relatives. I'm hoping to build a model that can predict if a case note contains that info or not. Any tips would be appreciated!
@JuliaSilge Před 7 měsíci
That sounds like it may be doable! Overall, I recommend this book I wrote with Emil for advice on predictive modeling for text data: smltar.com/
@elvinceager Před 7 měsíci
Yes, new video. its been a while. love this content.
@trevorschrotz Před 7 měsíci
Excellent, thank you.
@Adeyeye_seyison Před 7 měsíci
As Usual, Brilliant work from a brilliant data scientist...
@forheuristiclifeksh7836 Před 7 měsíci
0:08
@wapsyed Před 7 měsíci
Your videos are therapeutic haha
@mocabeentrill Před 7 měsíci
Clear explained and direct to the point! Thank you Julia.
@trevorschrotz Před 7 měsíci
Thanks for the tip on using type.predict = "response" in the broom::augment function. I learn something new from each of your vides, so thanks for the work that you put into making these.
@ColonelHathi Před 7 měsíci
YES! This is evidence supporting my hypothesis that Jodie Whittaker is a great actress, but Chris Chibnall is a terrible writer 😅. It isn't conclusive obviously. We would need observations from when they worked separately to get better proof.
@reshmilb2527 Před 7 měsíci
please avoid background colour black. use eyesight friendly colour
@Jakan-sf3xj Před 8 měsíci
Thank you for the great video. I have one question, assuming the best model was one of the tuned random forest models, how would we extract the parsnip object to see the tuned hyperparameters i.e mtry and min_n?
@JuliaSilge Před 8 měsíci
You might check out the different "extract" functions in tidymodels. You can do `extract_fit_parsnip()` but you can also do `extract_parameter_set_dials()` to get the hyperparameters directly: hardhat.tidymodels.org/reference/hardhat-extract.html
@user-sb9oc3bm7u Před 8 měsíci
Hey Julia. Writing you here although its not related to this specific video. I am using the tidylo::bind log odds for a project but the fact that the order of set/feature is different than the one in tidytext::bind tf idf (which requires term and then document) makes it hard to easily encode it for shiny (with f <- type_of_algorithm; f(docs/sets, terms/fts, n=n), for example). Any chance you change the tidylo order? Obviously it can be done with a simple if(){} else{} but its much cleaner to use the `f <-select_algo` approach :)
@JuliaSilge Před 8 měsíci
Can you open an issue over at tidylo with an example/reprex showing what you mean? github.com/juliasilge/tidylo/issues
@Jackeeba Před 8 měsíci
The extra brackets of copilot can be really annoying! When I turn it on I've started to use enter for RStudio's autocomplete (tab just takes copilot's often incorrect suggestion), does anyone have a better solution?
@JuliaSilge Před 8 měsíci
Looks like the RStudio team is tracking the extra parentheses here: github.com/rstudio/rstudio/issues/13953 Feel free to thumbs up or add additional detail!
@manueltiburtini6528 Před 8 měsíci
Amazing analysis! :O
@dantshisungu395 Před 8 měsíci
Great episode like always Just a little whim from my side: would you mind doing videos about TDA real-life application ? And how can implement Spark on models that aren't on Parnsip ? Thank you 😅
@matthewcarter1624 Před 8 měsíci
Hey Julia, thanks for you video ! I really like these.. I had one question about the std_var per writer... why did you divided it by the number of observations ?
@ariskoitsanos607 Před 8 měsíci
Hey hi, it's cos the standard error of the average is sigma/sqrt(n), so the variance of the average would be sigma^2/n
@soylentpink7845 Před 8 měsíci
Interesting! Could you do more practical applications of Bayesian statistics. I see it more and more asked and required by bigger tech companies.
@AlexLabuda Před 8 měsíci
So fun! Thanks for the video
@wilrivera2987 Před 8 měsíci
Big fan of Dr Who too
@nosinz753 Před 8 měsíci
adding this into my EDA tool box
@sr4823 Před 8 měsíci
Sometimes copilot feels more like a burden than a companion. Thanks Julia, I always learn a ton with these videos.
@manueltiburtini6528 Před 9 měsíci
Amazing work! you're so inspiring, thanks for sharing!

Julia Silge

Komentáře