StatQuest: Random Forests in R

StatQuest with Josh Starmer

zhlédnutí 154 941

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 18. 08. 2024
Random Forests are an easy to understand and easy to use machine learning technique that is surprisingly powerful. Here I show you, step by step, how to use them in R.
NOTE: There is an error at 13:26. I meant to call "as.dist()" instead of "dist()".
The code that I used in this video can be found on the StatQuest GitHub:
github.com/Sta...
If you're new to Random Forests, here's a video that covers the basics...
• StatQuest: Random Fore...
... and here's a video that covers missing data and sample clustering...
• StatQuest: Random Fore...
For a complete index of all the StatQuest videos, check out:
statquest.org/...
If you'd like to support StatQuest, please consider...
Support StatQuest by buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - statquest.gumr...
Paperback - www.amazon.com...
Kindle eBook - www.amazon.com...
Patreon: / statquest
...or...
CZcams Membership: / @statquest
...a cool StatQuest t-shirt or sweatshirt:
shop.spreadshi...
...buying one or two of my songs (or go large and get a whole album!)
joshuastarmer....
...or just donating to StatQuest!
www.paypal.me/...
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
#statquest #randomforest #ML

Komentáře • 404

@statquest Před 2 lety ⁺³
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@RaushanKumar-fq7bo Před 6 měsíci
I am using this loop command for random forest,
oob.error.data
@statquest Před 6 měsíci
@@RaushanKumar-fq7bo Are you using my code, or did you write your own?
@BayAreaLakers Před 3 lety ⁺¹⁰
Can't believe I went from not knowing anything about Machine Learning to learning so much after just a few days. Thanks Josh!
@statquest Před 3 lety ⁺¹
BAM!
@shaiguitar Před rokem ⁺¹
I second this. Priceless channel. DOUBLE BAM!
@chrisvaccaro229 Před 4 lety ⁺⁴⁴
Jesus Christmas this is incredibly useful. I code in R and
A) it's almost impossible to find ML tutorials for R
B) it's really hard to find straightforward ML tuts that are free of jargon ANYWAY
C) it's hard to find tuts in plain english and without talking about "y-hat" and crap i don't even remember from calculus
D) it's hard to find stat videos with such a good musical score ;)
and E) these are just awesome.
I'd literally given up on finding decent ML tuts for R and just said "screw it, I'll learn python" but then I found these accidentally. These are freaking epic. I literally just went through like 25% of your videos hitting "Shift + N" then liking them (next video, like button, next video, like button, next video, like button, etc.)
These videos are the BEST. You should make a MOOC. Yours would be better and easier to follow than Andrew Ng or Jeremy Howard (who are the superstars of ML.)
Maybe even make a course on DataCamp. You can make interactive ones that way.
Either way, these videos are starting from AI heaven.
@chrisvaccaro229 Před 4 lety ⁺⁴
You know what would be really really useful? If you made a teaching tutorial. Like, if you made a tutorial outlining your teaching philosophy and how you're able to make explainer videos so clear and concise. That way other teachers, professors, or even CZcamsrs could watch it and apply it to their OWN subjects. That would me like a full-blown meta-improvement to the educational world.
@statquest Před 4 lety
Thank you very much! :)
@statquest Před 4 lety ⁺⁶
Wow! That is very flattering. I recently did a talk at Duke University about how my teaching style. The talk was called "The elements of StatQuest". Maybe I'll turn that into a video.
@chrisvaccaro229 Před 4 lety
@@statquest Yea - please do!
@chrisvaccaro229 Před 4 lety
@@statquest Is there any chance you have a video copy of the talk in the meantime you'd be willing to send? I just looked up "The elements of StatQuest" and found a zoom link from Duke, but there was no recorded version available. You don't happen to have a recording, do you?
@BT-jh3dq Před 3 lety ⁺⁴
I've got so much more out of a couple of hours watching your videos than out of a couple of weeks trying to understand RFs through papers/books. Going back to the papers now, but with much more of a handle on what's going on. Thanks!
@statquest Před 3 lety
Glad to help!
@cajogos Před 4 lety ⁺¹¹
These videos using R are a lifesaver (quite literally!) Thanks a lot for these Josh!
@statquest Před 4 lety ⁺⁵
Thanks! :)
@sudiptomitra Před 3 lety ⁺³
This demo is end to end & complete in RF with R !! This can easily be rewarded as the "GOAT" in this subject. Thanks & looking forward to view more great demos on ML topics.
@statquest Před 3 lety
Wow, thanks!
@Lucrezio81 Před 3 lety ⁺¹
It's rare to find a video like this. Libraries, scripts, methodology, processes are so well explained and coherently organized. Even the technical language was amazingly clear for a not native English user like me. I wonder that 12 mentally poor people did not like it!
@statquest Před 3 lety
Thank you!
@dr.sangramsinha2784 Před 3 lety ⁺⁵
Recently I have been a regular follower of your channel. This is awesome. I learned a lot being neither from mathematics nor from computer science background. Even if being a experimental biologist, I understood most of your videos on regression analysis and now getting familiar with machine learning. I wonder if you could create some video on protein-protein or protein-ligand interaction using machine learning. I pay my deep respect to the effort you made to teach us all of the complex stuffs in such a simple way. Furthermore, you have beautiful voice too. I love to hear statQuest tunes. Lastly I pray for your good health and wealth.
@statquest Před 3 lety ⁺¹
Thank you very much! I'm glad my videos are helpful.
@nurinurlailasetiawan2689 Před rokem ⁺¹
Josh your channel is super awesome! I've been struggling to understand ML because I need to work with RF for my hyperspectral data. I read a lot of papers and books, but so far, your videos are the one that helps me the most! Very effectively communicated!!! Big thanks!!!
@statquest Před rokem
Awesome, thank you!
@lauraeli2286 Před 2 lety ⁺¹
You really are the best here on CZcams at explaining these 'complex' topics I think - I put inverted commas because actually they're not so complex anymore after watching your videos! :)
@statquest Před 2 lety
BAM! :)
@justarandomchannel5246 Před rokem ⁺¹
I was falling asleep reading my coursework's material the ukulele touch and some fun bits u put in makes this dreading boring subject a bit interesting. Thanks mate!
@statquest Před rokem
bam! :)
@glauberbrito8685 Před 4 lety ⁺⁶
You saved my day, Josh. You did a GREAT JOB !! Congrats.
@statquest Před 4 lety
Hooray!!! :)
@BrianUrlacherPoliSci Před 5 lety ⁺²
This was awesome. I've been working for 2 days to wrap my head around the R implementation of this. The code I was working with now makes perfect sense.
@statquest Před 5 lety ⁺¹
Hooray! :)
@shahrizalmuhammadabdillah3127 Před 11 měsíci ⁺¹
The tricks so fancy, and help me. I'm cheering to watch this...
@statquest Před 11 měsíci ⁺¹
Thanks!
@shahrizalmuhammadabdillah3127 Před 11 měsíci ⁺¹
i cant believe it, i just watch it this now, and i love this Statquest. thank josh.. you make me open minded again to another job
@statquest Před 11 měsíci
Happy to help!
@jasperobico1459 Před 5 lety ⁺¹
Your tutorial video was really helpful! I am not sure if I would be able to do Random Forest without seeing this one! Great job on making a tutorial video that is easy to follow and to understand for non-R users like me. Kudos!
@veducatube5701 Před 4 lety ⁺⁶
Dear Sir!
You saved a lot of my time and a lot of my energy. Thank You... God Bless You with health and Wealth.
Please keep making videos and keep saving our lives...
@statquest Před 4 lety
Thank you!
@teetanrobotics5363 Před 3 lety ⁺¹²
I love your channel and have almost finished the entire ML playlist. You're explanation, animations and diagrams are just amazing🔥🔥 and far better than most university curriculum. I had a request. Just like the R tutorials, Could you please make the python version of the machine learning models ?
@statquest Před 3 lety ⁺²
I'd like to do that as soon as I have time.
@pacificbloom1 Před 3 lety ⁺¹
@@statquest Kindly consider this as a request from one more fan of yours....really need python videos because this is the only channel I have subscribed to learn data science/machine learning
@ChunLin_UoE Před 5 lety ⁺⁶
Thank you very much - very detailed explanation! It may be easier to convert the err.rate matrix to a data frame and use tidyr::gather() to transform it for ggplot2.
@brendenmorley2643 Před 4 lety
Yeah thats what i did.....but with Pivot longer
@kinwong6383 Před 5 lety ⁺²
Love the way you show both ways of doing certain thing. It really helps R beginner like me a lot!
Thank you very much! Wish I can go visit you at performance one day.
@statquest Před 5 lety ⁺¹
Thank you so much! I'm glad to hear my videos are helpful. :)
@adityanjsg99 Před 4 lety ⁺¹
You are such an awesome narrator!
I depend more on your videos than my teacher.
@statquest Před 4 lety
Thank you! :)
@j.jayelynnshin4289 Před 3 lety ⁺²
I don't understand ppl who clicked on "dislike" at all. Thank you for doing this!!
@statquest Před 3 lety
Thanks!
@MrRoshanchoudhary Před 6 lety
Hi Joshua, Your explanations are mindblowing. I'm loving it. The way you explain each and every notes are simply awesome. I'm grateful to you. Thank you so much. Keep making such videos. Waiting eagerly for Logistic Regression. Bammm!!!!! :)
@MrRoshanchoudhary Před 6 lety
Perfect!! You are the mahn :)
@alexandersierraa Před 5 lety ⁺¹
Thanks a lot Josh, your presentation is very clear and depth
@statquest Před 5 lety
Thanks! :)
@melaniemax6437 Před rokem ⁺¹
thank u so much! really helpful for me as a beginner in machine learning.
@statquest Před rokem
bam! :)
@ffloresalfaro Před 5 lety ⁺²
Love your videos! Proximity matrix is excellent. Thanks so much for making these great videos!!
@statquest Před 5 lety
Hooray! I'm glad you like StatQuest! :)
@himanshu8006 Před 5 lety ⁺¹
cant be explained easier then this ...... great job Josh, thanks a lot
@statquest Před 5 lety
Thank you! :)
@Rpekeno Před 6 lety
This video is SO good. I'm a newcomer at this, and your materials have helped me a lot! Thanks!
@tizhang9635 Před 3 lety ⁺¹
Thanks very much for your channel!!!! Way easier to understand than reading paper.
@statquest Před 3 lety
Glad to hear that!
@nathaliatf Před 5 lety ⁺¹
Great Video! Efficient and not boring at all!!
@francinagoh2541 Před 3 lety ⁺¹
Thanks I learn alot from your video. Have a nice day!
@statquest Před 3 lety
Thanks!
@anushkabanerjee2510 Před rokem ⁺¹
Fantastically explained !!
@statquest Před rokem ⁺¹
Glad you liked it!
@revenez Před 4 lety ⁺¹
Brilliant and enjoyable!
Thank you and please keep up the good work.
@statquest Před 4 lety ⁺¹
Thank you! :)
@jitenjaipuria Před 6 měsíci ⁺¹
thank youuuuuuuuuuuu. i will acknowledge you in my scientific paper
@statquest Před 6 měsíci
Thank you very much!
@angelique3062 Před 4 lety ⁺³
Thank you Josh! :) You really have a gift for teaching! Could you please do a random forest regression in R?
@statquest Před 4 lety ⁺³
Possibly! I'll put it on the to-do list.
@imanep4902 Před 4 lety
@@statquest nice, looking forward to it!
@yoyohu6522 Před 4 lety
@@statquest Thanks! looking forward to the RF regression in R.
@mariyapak428 Před 2 lety
@@statquest -- Thank you Josh!
@yumikowiranto4330 Před 3 lety ⁺¹
Thank you so much!!!!! This is really helpful for my assignment
@statquest Před 3 lety ⁺¹
Glad it was helpful!
@yumikowiranto4330 Před 3 lety
@@statquest is there a limitation in terms of the kind of variables I can include as predictors? For example, can I include race (e.g., white, hispanic, african-american, asian, other)?
@statquest Před 3 lety ⁺¹
@@yumikowiranto4330 As far as I know, there are no limitations on the types of variables you can use as predictors.
@DanTaninecz Před 5 lety ⁺¹
That mtry trick is pretty slick.
@statquest Před 5 lety
:)
@steliosgiannopoulos8297 Před 3 lety ⁺¹
Change the nick to Josh R-Charmer , excellent work thank you for all of your videos !!!
@statquest Před 3 lety
BAM! :)
@amirgharavi4082 Před 5 lety
Thanks so much for making these great videos. Really appreciate it
@HarshKumar-zc4ox Před 5 lety ⁺²
Great job Starmer. You explained everything quite nicely.
However, while explaining the confusion matrix, you went wrong as the vertical columns are for ground truth and horizontal rows are for predicted values. The explanation should have been 28 healthy patients were miss classified as unhealthy patients but you explained opposite. Same case with false positive. I saw you confusion matrix lecture, there you have correctly explained the confusion matrix.
@wei2674 Před 4 lety
Harsh Kumar I think R output it this way so that 0.14 is the type1 error rate/ false positive rate. Which means 23 healthy classified as unhealthy (false positive)
@AOLFlyersNewsletters Před 4 lety ⁺¹
Josh - you are like a god! Thanks man.
@statquest Před 4 lety
Thank you! :)
@alecvan7143 Před 4 lety ⁺¹
Super helpful, thanks Josh
@statquest Před 4 lety
Hooray! (by the way, you might be in the running for the most comments from a single viewer! Keep'em up!)
@kaam975 Před 2 lety ⁺¹
Thanks for the code!
@statquest Před 2 lety
:)
@serman5671 Před rokem ⁺¹
so well explained
@statquest Před rokem
Thank you!
Před 4 měsíci ⁺¹
Awesome statQuest, did not know you can also impute data using random forests :) How does the analysis of parameters (ntree, mtry) change if we are doing regression instead of classification? Would also love to see a regression example.
@statquest Před 4 měsíci ⁺¹
I've never used it for regression, but I'll keep that topic in mind.
@christiansetzkorn6241 Před 3 lety ⁺¹
Great stuff! Thanks!
@statquest Před 3 lety
Thanks!
@andreaballestero7780 Před 3 lety ⁺¹
This was very helpful, thank you!! :)
@statquest Před 3 lety
Glad it was helpful!
@mateuszjaworski2974 Před 3 lety ⁺¹
Hi Josh! It would be great if u could show us how after building random forest get some predictions on brand new data ;)
@statquest Před 3 lety ⁺¹
Great idea!
@wsgsantos Před 5 lety ⁺¹
Very good explanation! Thanks from Brazil! :-)
@statquest Před 5 lety ⁺¹
Muito obrigado! :)
@pedrosenna100 Před 5 lety ⁺¹
@@statquest I am a professor in Industrial engineering course in Brazil and just discovered your channel, i simply loved the videos! i teach logistics but i was wanting to put some data science practices and your channel is just perfect, i can't thank you enough for the help you gave me being so didactic!
@statquest Před 5 lety
@@pedrosenna100 Hooray!!! I'm so glad to hear that my videos are helpful in Brazil. It's a beautiful country with an amazing culture. I visited once a few years ago and hope to visit again as soon as I can.
@vivianhu3389 Před 4 lety ⁺¹
Super Clear! THANK YOU!
@statquest Před 4 lety
Thanks! :)
@PetalGamesStudios Před 4 lety ⁺¹
Awesome video! Thanks again!
@statquest Před 4 lety
Thanks!
@ImGeneralJAckson Před 10 měsíci ⁺¹
that's it. I'm buying a shirt!
@statquest Před 10 měsíci
bam! :)
@hikikomorihachiman7491 Před měsícem ⁺¹
Thank you
@statquest Před měsícem
You're welcome!
@balaji.r2735 Před 4 lety ⁺¹
Thank you very much
@statquest Před 4 lety
You are welcome
@IamCaptainMan Před 3 lety ⁺¹
Thanks man, you're awesome!
@statquest Před 3 lety
Glad to help!
@Wissro Před rokem ⁺¹
Thank you so much, could you perhaps make more R tutorials for machine learning techniques?
@statquest Před rokem ⁺¹
I'll keep that in mind.
@Wissro Před rokem ⁺¹
@@statquest Thanks for the quick reply!
@hiteshpant Před 4 lety ⁺¹
hi Josh, I really enjoy watching your videos and like the way you have made statistical topics so easy to interpret. Do you have a video for Feature Selection(varImp) using Random Forest?
@statquest Před 4 lety
Not yet.
@PaulO-mv6ku Před 5 lety ⁺¹
Brilliant - many thanks.
@statquest Před 5 lety
Thanks! :)
@andrezaluko Před 6 lety ⁺²
Josh Starmer, I am your fan! You are very funny =D
@statquest Před 6 lety
Double Hooray!!! :D
@iBenutzername Před 2 lety ⁺¹
Awesome as always! Can I ask you to make a video about feature importance in RF models?
@statquest Před 2 lety ⁺¹
I'll keep that in mind.
@AR_Wald Před 3 lety ⁺¹
Hooray!
@statquest Před 3 lety ⁺¹
bam!
@afcc777f Před 6 lety ⁺¹⁴
Can make a video about random forest for regressions in R ?
Thanks
@statquest Před 6 lety ⁺⁸
I've added it to the to-do list, but it might be a while before I get to it.
@afcc777f Před 6 lety ⁺²
thanks
@baherazzam8863 Před 6 lety ⁺³
Thank you! I am also looking forward to that
@cynical_dd Před 6 lety ⁺³
Hi, Im hoping for this too! Pretty pleaseeee, thank you!
@rajatbhosale8188 Před 5 lety ⁺²
Even I would like to get that.
@benben0814 Před 6 lety
Hey Josh this is very helpful and thanks for all the work! Does your code include cross validation for the random forest?
@charliepierce6218 Před 4 lety ⁺¹
Amazing!
@statquest Před 4 lety
Thanks!
@mamahotel1308 Před 5 lety
Love this, thank you!
@lifeboston853 Před 6 lety ⁺¹
Hello Joshua, I watched all your videos and they are so awesome! Will you be able to teach us Shrinkage Method (Ridge, Lasso and PCR), Neural Network, Deep leaning, Image analysis, and video analysis?
@lifeboston853 Před 6 lety
Thanks so much！ I am looking forward to all your future videos :)
@kaam975 Před 2 lety ⁺¹
and for the video of course :)
@statquest Před 2 lety
bam!
@moniquebrogan7206 Před 2 lety
Thanks so much for your great videos. Do you cover Variable Importance in any of your videos?
@statquest Před 2 lety ⁺¹
Yes. The most conventional approach is with regularization: czcams.com/video/Q81RR3yKn30/video.html
@Pavijace Před 6 lety
ukelele...lol.....serious concept explained with fun...tq..keep goin...:-))am going home to you...nice song ...btw
@waasdelcolenwtn Před 2 lety ⁺¹
goat
@statquest Před 2 lety
Thanks!
@JoelAgarwal-yl2kw Před rokem
Hi Josh! Amazing video - has been super helpful in my understanding. Quick question, how would I find the AUC and ROC curve for the random forest model based on the code that you made? I'm trying to compare different models to see which is best (as well as compare to logistic regression).
@statquest Před rokem
I show how to do that exact thing (AUC and ROC for random forest) in this video: czcams.com/video/qcvAqAH60Yw/video.html
@imanep4902 Před 4 lety ⁺²
BAM haha thank you!
@statquest Před 4 lety
BAM! :)
@fritz3555 Před 5 lety ⁺¹
Thanks for the great video series. What about randomForestSRC package? If we have data with missing values, is it better to use the randomForestSRC package? Or should we use the randomForest package?
@statquest Před 5 lety
Unfortunately, I’ve only used the randomForest library, so I can’t tell you which one is better.
@user-bz8nm6eb6g Před 4 lety ⁺¹
Thanks!!
@statquest Před 4 lety
:)
@abohisham3088 Před 4 lety ⁺¹
helpful and funny, continue
@statquest Před 4 lety
Thanks! :)
@TheEyeofJun Před 5 lety ⁺¹
Hooray!!!
@statquest Před 5 lety
Exactly! :)
@thuanpin Před 5 lety
Hei, Thanks so much for your great lecture. May I ask you questions?
1) why did you label for sex and hd, not for the other categorical variables? the levels of ca and thal changes after converted, do they influence to model?
2) Do we need normalize continuous variables before conducting random forest?
Many thanks!
@zainabkhan2475 Před 4 lety ⁺¹
Thanks for this video but I have a question,
can you ad codes or some example for the RF regression?
Please...
@statquest Před 4 lety ⁺²
I'll keep it in mind.
@zainabkhan2475 Před 4 lety
@@statquest Thank You StatQuest....
@user-uz1wz4gp9d Před 5 lety
Fantastic vedic! Very clear!
Just have one more question, does RandomForest work with multiple columns of missing?
@reimiranda3213 Před 4 lety ⁺¹
If you have any ecology examples for these stat quests that would be really useful!
@SergeySkripko Před 5 lety ⁺¹
Josh, you used cmdscale() on a default dist(method="euclidian") matrix. Does it mean that you did PCA, according to your MDS and PCoA video?
@statquest Před 5 lety ⁺¹
Great question! Technically you could say that we did PCA on the distance matrix - but PCA is generally thought of as being applied to the raw data and MDS is applied to a distance matrix. So the difference is sort of in the spirit of how the data is processed, which is relatively minor.
@MahdiSafarpour Před 3 lety
I have two questions about optimization of RF hyperparameters (mtry and ntree):
1) Should we first find the optimum number of trees and then optimum number of variables? or we must consider the effect of these two parameters simultaneously?
2) In this video, we examined the pattern of OOB when number of tree increases. Is this a good decision rule to choose the optimum number of trees just based on OOB? or it is better to use other methods such as cross-validation (I am looking for a way to find the best bias-variance tradeoff )?
@statquest Před 3 lety
1) Ideally you would find them simultaneously.
2) Depending on who you talk to, you'll probably hear both methods as optimal.
@MahdiSafarpour Před 3 lety
@@statquest Thank you so much for your reply. May you please introduce me any article or book that talks about this topic (optimization of RF hyperparameters) in more details!
@statquest Před 3 lety
@@MahdiSafarpour Here's a great place to start: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm In it Breiman says that the only important parameter is the number of variables selected for each tree.
@MahdiSafarpour Před 3 lety ⁺¹
@@statquest Thanks a lot for your help.
@bryanparis7779 Před 2 lety ⁺¹
hd variable when converted as a categorical should have 4 levels:"0","1","2","3". Instead we used the if-else function in order to produce only 2 levels of "0" and "1"...Why is that?
@statquest Před 2 lety
Because I wanted to simplify the problem to only identify whether or not someone had heart disease.
@bryanparis7779 Před 2 lety ⁺¹
@@statquest Τhank you for answering:)
@jacquelinmontoyahidalgo6714 Před 2 lety ⁺¹
great video! do u have any tutorial with regression random forest?
@statquest Před 2 lety
Not yet.
@olivermcneice8440 Před rokem
I had to add 'as.factor(myOutputVariable)' because it needs to be numeric.
@statquest Před rokem
ok
@rrrprogram8667 Před 6 lety
Visiting again and again
@sam_AI_Dr Před 6 lety ⁺¹
Hello Joshua, at the point where you were determining the optimal number of variables at each internal node, is there a reason why you selected the empty vector length to be 10?
@Rpekeno Před 6 lety
I'm new to this and have been wondering, this is the thing they call "curse of dimensionality" isn't it? You wanted to make sure you didn't try out too many variables (increasing dimension, and thus overfitting) or too few variables, did I get it right?
@monicasteffimatchado1780 Před 4 lety ⁺¹
Thank you so much for the clear explanation. I have a microbiome datasets with 133 samples 431 features. I would like to try RF. How do I decide the range of mtry value ?
@statquest Před 4 lety ⁺²
I talk about this in the original Random Forest video: czcams.com/video/J4Wdy0Wc_xQ/video.html You start with the default, which is the square root of number of variables, but can use cross validation to try other values.
@joshstat8114 Před 6 měsíci
Fellow "Josh". Thanks to this video. Can you have a part 2 about random forest in R that uses `ranger` package? It still kinda the same but faster
@statquest Před 6 měsíci
I'll keep that in mind.
@joshstat8114 Před 6 měsíci ⁺¹
@@statquest thanks. I am looking forward to it
@SergeySkripko Před 4 lety
Maybe a stupid question but I can't understand why do we use dist() function? In your video about imputing missing values, you told that "1 - proximity" means a distance between samples. I understand it. Why do we need to compute distance over distance? What's the point? As I see, every column, say column "i", in "1 - proximity" means distances between the "i"th sample and all other samples. And then we calculate the distance(?) between these distances of "i" and another sample, "j". That's weird :)
On the other side,
1. dim(1 - proximity) == n_samples * n_samples.
2. dim(dist(1 - proximity)) == n_samples * n_samples (as well).
This blows my mind. I see redundancy as a recursive call: dist(dist(dist(...(1 - proximity)))
@statquest Před 4 lety ⁺¹
You found a typo. I meant to call "as.dist()" instead of "dist()". We just want to convert our matrix of proximities into an object of class "dist".
@SergeySkripko Před 4 lety ⁺¹
@@statquest thank you very much! I thought I just didn't understand something
@lucpr4501 Před 4 lety
Good Morning. Thank you for your video and your time. May I ask you why do you use the Random Forest package for a binary response variable (Y variable is equal to 0 or 1). Should not we use a Bernoulli loss function instead of the quadratic loss function when splits are performed in the tree?
@statquest Před 4 lety
For classification, randomForest() uses Gini impurity to decide if it should create a new branch. For more information about how Gini impurity is used, see: czcams.com/video/7VeUPuFGJHk/video.html
@gabrielcrone6753 Před 2 měsíci
Hi, Josh. Excellent video! So helpful and clear! 😄I am using a new version of randomForrest, and I cannot seem to locate within my model object the err.rate vector. When I write, "model$err.rate", it returns nothing. Do you know if there are equivalent objects now inside of the model to extract the error rate info? Thanks!
@statquest Před 2 měsíci
What is the exact version you are using? 4.7-1.1 has err.rate. You can see it in the documentation here: cran.r-project.org/web/packages/randomForest/randomForest.pdf
@rubenpinnata4626 Před 4 lety
hi Josh! Great videos as always
a quick question: once you have declared a variable as factor, can you use MDS?
you said it is very similar to PCA and from what I know, PCA needs scaling which I am not sure will work with categorical variables until you hot encode them, which I dont see any here.
Can you verify its okay to use MDS plot for data with both continuous and categorical?
thanks and stay safe
@statquest Před 4 lety ⁺¹
We apply MDS to the proximity/distance matrix, which is not the same thing as applying it to the raw data. In other words, the process of creating the proximity matrix converts the factors into distances that are suitable for MDS.
@rubenpinnata4626 Před 4 lety ⁺¹
@@statquest perfect! Thanks as always Josh
@srinivasv3268 Před 5 lety ⁺²
Hi, Could you please upload some multi class prediction, example : we have one train and test data set first we need predict train data than prediction on test data
Thanks
@statquest Před 5 lety
I've only done multi-class prediction in Python, but the documentation for randomForest (the R package) indicates that, just like with Python, there's no difference between predicting two classes and predicting more than two classes.
@amitt9053 Před 5 lety
How to fill in missing values if they are numeric? (For classification samples could be created using possible classes say Y or N)
@RPDBY Před 6 lety ⁺¹
Thank you for the great tutorial. I am confused though why do we need to impute our outcome variable, is it justified? Wouldn't it be more reasonable to treat the NAs in our outcome variable as unlabeled data and train the model on labeled data only? Imputing an outcome variable seems like a dubious practice, but maybe i am wrong.
Also, on a technical side, how can we access the actual predicted values per id (i.e. in this case per patient)? Thanks a lot for the video once again!
@statquest Před 6 lety ⁺¹
In an ideal world, you would never have to impute anything. But in practice, sometimes data isn't complete and you don't have a lot of it. So, in these situations, you may not have a choice - it's definitely not ideal, though. Your word, "dubious" is a good description!
You can get the predicted values, which correspond to the the rows in the input data, with "model$predicted".
@RPDBY Před 6 lety ⁺¹
Thank you so much for the prompt answers!
@fishfeelpain7764 Před rokem
Isn't the confusion matrix built with the predicted class as rows, and observed class as columns?
@statquest Před rokem
Not always. Unfortunately there is no standard practice.
@random-ds Před 5 lety ⁺¹
Hello, thank you again for you excellent video, however, I still have on question: what is the difference between what you did (RFimpute with 6 iterations) and the MissForest algorithm
Thank you again!
@statquest Před 5 lety
That's a good question. Unfortunately, I've never used MissForest, so I can't tell you the answer.
@ioanastanescu6690 Před 2 lety
Hey everyone, quick question. When you start building the model you write set.seed(42). Where does that 42 come from? Thanks for the videos, they are really great! :)
@statquest Před 2 lety ⁺¹
See: en.wikipedia.org/wiki/42_(number)#The_Hitchhiker's_Guide_to_the_Galaxy
@drtlfletcher Před 2 lety
@@statquest That is both funny and somewhat unhelpful! Do you mean that the set.seed value can be anything and you chose 42 because you like Douglas Adams?
What parts of the rest of the tutorial will be affected by the 'set.seed' function? Is this just applicable to rfImpute, or will this impact our randomForest function as well?
@statquest Před 2 lety
@@drtlfletcher When we both set the seed for the random number generator to the same number (any number, as long as we use the same number), then we well both get the same results, even though a lot of the process is "random". Setting the seed effects any "random" events that follow the call to set.seed(), so it effects rfImput as well as randomForest.
@charangrewal6113 Před 6 lety ⁺¹
How do we know which variables the random forest chose to use in the final model?
@statquest Před 6 lety ⁺¹
If you build a random forest...
model

Další v pořadí

Automatické přehrávání