StatQuest: Random Forests in R

Sdílet
Vložit
  • čas přidán 18. 08. 2024
  • Random Forests are an easy to understand and easy to use machine learning technique that is surprisingly powerful. Here I show you, step by step, how to use them in R.
    NOTE: There is an error at 13:26. I meant to call "as.dist()" instead of "dist()".
    The code that I used in this video can be found on the StatQuest GitHub:
    github.com/Sta...
    If you're new to Random Forests, here's a video that covers the basics...
    • StatQuest: Random Fore...
    ... and here's a video that covers missing data and sample clustering...
    • StatQuest: Random Fore...
    For a complete index of all the StatQuest videos, check out:
    statquest.org/...
    If you'd like to support StatQuest, please consider...
    Support StatQuest by buying The StatQuest Illustrated Guide to Machine Learning!!!
    PDF - statquest.gumr...
    Paperback - www.amazon.com...
    Kindle eBook - www.amazon.com...
    Patreon: / statquest
    ...or...
    CZcams Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshi...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer....
    ...or just donating to StatQuest!
    www.paypal.me/...
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    #statquest #randomforest #ML

Komentáře • 404

  • @statquest
    @statquest  Před 2 lety +3

    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @RaushanKumar-fq7bo
      @RaushanKumar-fq7bo Před 6 měsíci

      I am using this loop command for random forest,
      oob.error.data

    • @statquest
      @statquest  Před 6 měsíci

      @@RaushanKumar-fq7bo Are you using my code, or did you write your own?

  • @BayAreaLakers
    @BayAreaLakers Před 3 lety +10

    Can't believe I went from not knowing anything about Machine Learning to learning so much after just a few days. Thanks Josh!

  • @chrisvaccaro229
    @chrisvaccaro229 Před 4 lety +44

    Jesus Christmas this is incredibly useful. I code in R and
    A) it's almost impossible to find ML tutorials for R
    B) it's really hard to find straightforward ML tuts that are free of jargon ANYWAY
    C) it's hard to find tuts in plain english and without talking about "y-hat" and crap i don't even remember from calculus
    D) it's hard to find stat videos with such a good musical score ;)
    and E) these are just awesome.
    I'd literally given up on finding decent ML tuts for R and just said "screw it, I'll learn python" but then I found these accidentally. These are freaking epic. I literally just went through like 25% of your videos hitting "Shift + N" then liking them (next video, like button, next video, like button, next video, like button, etc.)
    These videos are the BEST. You should make a MOOC. Yours would be better and easier to follow than Andrew Ng or Jeremy Howard (who are the superstars of ML.)
    Maybe even make a course on DataCamp. You can make interactive ones that way.
    Either way, these videos are starting from AI heaven.

    • @chrisvaccaro229
      @chrisvaccaro229 Před 4 lety +4

      You know what would be really really useful? If you made a teaching tutorial. Like, if you made a tutorial outlining your teaching philosophy and how you're able to make explainer videos so clear and concise. That way other teachers, professors, or even CZcamsrs could watch it and apply it to their OWN subjects. That would me like a full-blown meta-improvement to the educational world.

    • @statquest
      @statquest  Před 4 lety

      Thank you very much! :)

    • @statquest
      @statquest  Před 4 lety +6

      Wow! That is very flattering. I recently did a talk at Duke University about how my teaching style. The talk was called "The elements of StatQuest". Maybe I'll turn that into a video.

    • @chrisvaccaro229
      @chrisvaccaro229 Před 4 lety

      @@statquest Yea - please do!

    • @chrisvaccaro229
      @chrisvaccaro229 Před 4 lety

      @@statquest Is there any chance you have a video copy of the talk in the meantime you'd be willing to send? I just looked up "The elements of StatQuest" and found a zoom link from Duke, but there was no recorded version available. You don't happen to have a recording, do you?

  • @BT-jh3dq
    @BT-jh3dq Před 3 lety +4

    I've got so much more out of a couple of hours watching your videos than out of a couple of weeks trying to understand RFs through papers/books. Going back to the papers now, but with much more of a handle on what's going on. Thanks!

  • @cajogos
    @cajogos Před 4 lety +11

    These videos using R are a lifesaver (quite literally!) Thanks a lot for these Josh!

  • @sudiptomitra
    @sudiptomitra Před 3 lety +3

    This demo is end to end & complete in RF with R !! This can easily be rewarded as the "GOAT" in this subject. Thanks & looking forward to view more great demos on ML topics.

  • @Lucrezio81
    @Lucrezio81 Před 3 lety +1

    It's rare to find a video like this. Libraries, scripts, methodology, processes are so well explained and coherently organized. Even the technical language was amazingly clear for a not native English user like me. I wonder that 12 mentally poor people did not like it!

  • @dr.sangramsinha2784
    @dr.sangramsinha2784 Před 3 lety +5

    Recently I have been a regular follower of your channel. This is awesome. I learned a lot being neither from mathematics nor from computer science background. Even if being a experimental biologist, I understood most of your videos on regression analysis and now getting familiar with machine learning. I wonder if you could create some video on protein-protein or protein-ligand interaction using machine learning. I pay my deep respect to the effort you made to teach us all of the complex stuffs in such a simple way. Furthermore, you have beautiful voice too. I love to hear statQuest tunes. Lastly I pray for your good health and wealth.

    • @statquest
      @statquest  Před 3 lety +1

      Thank you very much! I'm glad my videos are helpful.

  • @nurinurlailasetiawan2689

    Josh your channel is super awesome! I've been struggling to understand ML because I need to work with RF for my hyperspectral data. I read a lot of papers and books, but so far, your videos are the one that helps me the most! Very effectively communicated!!! Big thanks!!!

  • @lauraeli2286
    @lauraeli2286 Před 2 lety +1

    You really are the best here on CZcams at explaining these 'complex' topics I think - I put inverted commas because actually they're not so complex anymore after watching your videos! :)

  • @justarandomchannel5246
    @justarandomchannel5246 Před rokem +1

    I was falling asleep reading my coursework's material the ukulele touch and some fun bits u put in makes this dreading boring subject a bit interesting. Thanks mate!

  • @glauberbrito8685
    @glauberbrito8685 Před 4 lety +6

    You saved my day, Josh. You did a GREAT JOB !! Congrats.

  • @BrianUrlacherPoliSci
    @BrianUrlacherPoliSci Před 5 lety +2

    This was awesome. I've been working for 2 days to wrap my head around the R implementation of this. The code I was working with now makes perfect sense.

  • @shahrizalmuhammadabdillah3127
    @shahrizalmuhammadabdillah3127 Před 11 měsíci +1

    The tricks so fancy, and help me. I'm cheering to watch this...

  • @shahrizalmuhammadabdillah3127
    @shahrizalmuhammadabdillah3127 Před 11 měsíci +1

    i cant believe it, i just watch it this now, and i love this Statquest. thank josh.. you make me open minded again to another job

  • @jasperobico1459
    @jasperobico1459 Před 5 lety +1

    Your tutorial video was really helpful! I am not sure if I would be able to do Random Forest without seeing this one! Great job on making a tutorial video that is easy to follow and to understand for non-R users like me. Kudos!

  • @veducatube5701
    @veducatube5701 Před 4 lety +6

    Dear Sir!
    You saved a lot of my time and a lot of my energy. Thank You... God Bless You with health and Wealth.
    Please keep making videos and keep saving our lives...

  • @teetanrobotics5363
    @teetanrobotics5363 Před 3 lety +12

    I love your channel and have almost finished the entire ML playlist. You're explanation, animations and diagrams are just amazing🔥🔥 and far better than most university curriculum. I had a request. Just like the R tutorials, Could you please make the python version of the machine learning models ?

    • @statquest
      @statquest  Před 3 lety +2

      I'd like to do that as soon as I have time.

    • @pacificbloom1
      @pacificbloom1 Před 3 lety +1

      @@statquest Kindly consider this as a request from one more fan of yours....really need python videos because this is the only channel I have subscribed to learn data science/machine learning

  • @ChunLin_UoE
    @ChunLin_UoE Před 5 lety +6

    Thank you very much - very detailed explanation! It may be easier to convert the err.rate matrix to a data frame and use tidyr::gather() to transform it for ggplot2.

  • @kinwong6383
    @kinwong6383 Před 5 lety +2

    Love the way you show both ways of doing certain thing. It really helps R beginner like me a lot!
    Thank you very much! Wish I can go visit you at performance one day.

    • @statquest
      @statquest  Před 5 lety +1

      Thank you so much! I'm glad to hear my videos are helpful. :)

  • @adityanjsg99
    @adityanjsg99 Před 4 lety +1

    You are such an awesome narrator!
    I depend more on your videos than my teacher.

  • @j.jayelynnshin4289
    @j.jayelynnshin4289 Před 3 lety +2

    I don't understand ppl who clicked on "dislike" at all. Thank you for doing this!!

  • @MrRoshanchoudhary
    @MrRoshanchoudhary Před 6 lety

    Hi Joshua, Your explanations are mindblowing. I'm loving it. The way you explain each and every notes are simply awesome. I'm grateful to you. Thank you so much. Keep making such videos. Waiting eagerly for Logistic Regression. Bammm!!!!! :)

  • @alexandersierraa
    @alexandersierraa Před 5 lety +1

    Thanks a lot Josh, your presentation is very clear and depth

  • @melaniemax6437
    @melaniemax6437 Před rokem +1

    thank u so much! really helpful for me as a beginner in machine learning.

  • @ffloresalfaro
    @ffloresalfaro Před 5 lety +2

    Love your videos! Proximity matrix is excellent. Thanks so much for making these great videos!!

    • @statquest
      @statquest  Před 5 lety

      Hooray! I'm glad you like StatQuest! :)

  • @himanshu8006
    @himanshu8006 Před 5 lety +1

    cant be explained easier then this ...... great job Josh, thanks a lot

  • @Rpekeno
    @Rpekeno Před 6 lety

    This video is SO good. I'm a newcomer at this, and your materials have helped me a lot! Thanks!

  • @tizhang9635
    @tizhang9635 Před 3 lety +1

    Thanks very much for your channel!!!! Way easier to understand than reading paper.

  • @nathaliatf
    @nathaliatf Před 5 lety +1

    Great Video! Efficient and not boring at all!!

  • @francinagoh2541
    @francinagoh2541 Před 3 lety +1

    Thanks I learn alot from your video. Have a nice day!

  • @anushkabanerjee2510
    @anushkabanerjee2510 Před rokem +1

    Fantastically explained !!

  • @revenez
    @revenez Před 4 lety +1

    Brilliant and enjoyable!
    Thank you and please keep up the good work.

  • @jitenjaipuria
    @jitenjaipuria Před 6 měsíci +1

    thank youuuuuuuuuuuu. i will acknowledge you in my scientific paper

  • @angelique3062
    @angelique3062 Před 4 lety +3

    Thank you Josh! :) You really have a gift for teaching! Could you please do a random forest regression in R?

    • @statquest
      @statquest  Před 4 lety +3

      Possibly! I'll put it on the to-do list.

    • @imanep4902
      @imanep4902 Před 4 lety

      @@statquest nice, looking forward to it!

    • @yoyohu6522
      @yoyohu6522 Před 4 lety

      @@statquest Thanks! looking forward to the RF regression in R.

    • @mariyapak428
      @mariyapak428 Před 2 lety

      @@statquest -- Thank you Josh!

  • @yumikowiranto4330
    @yumikowiranto4330 Před 3 lety +1

    Thank you so much!!!!! This is really helpful for my assignment

    • @statquest
      @statquest  Před 3 lety +1

      Glad it was helpful!

    • @yumikowiranto4330
      @yumikowiranto4330 Před 3 lety

      @@statquest is there a limitation in terms of the kind of variables I can include as predictors? For example, can I include race (e.g., white, hispanic, african-american, asian, other)?

    • @statquest
      @statquest  Před 3 lety +1

      @@yumikowiranto4330 As far as I know, there are no limitations on the types of variables you can use as predictors.

  • @DanTaninecz
    @DanTaninecz Před 5 lety +1

    That mtry trick is pretty slick.

  • @steliosgiannopoulos8297
    @steliosgiannopoulos8297 Před 3 lety +1

    Change the nick to Josh R-Charmer , excellent work thank you for all of your videos !!!

  • @amirgharavi4082
    @amirgharavi4082 Před 5 lety

    Thanks so much for making these great videos. Really appreciate it

  • @HarshKumar-zc4ox
    @HarshKumar-zc4ox Před 5 lety +2

    Great job Starmer. You explained everything quite nicely.
    However, while explaining the confusion matrix, you went wrong as the vertical columns are for ground truth and horizontal rows are for predicted values. The explanation should have been 28 healthy patients were miss classified as unhealthy patients but you explained opposite. Same case with false positive. I saw you confusion matrix lecture, there you have correctly explained the confusion matrix.

    • @wei2674
      @wei2674 Před 4 lety

      Harsh Kumar I think R output it this way so that 0.14 is the type1 error rate/ false positive rate. Which means 23 healthy classified as unhealthy (false positive)

  • @AOLFlyersNewsletters
    @AOLFlyersNewsletters Před 4 lety +1

    Josh - you are like a god! Thanks man.

  • @alecvan7143
    @alecvan7143 Před 4 lety +1

    Super helpful, thanks Josh

    • @statquest
      @statquest  Před 4 lety

      Hooray! (by the way, you might be in the running for the most comments from a single viewer! Keep'em up!)

  • @kaam975
    @kaam975 Před 2 lety +1

    Thanks for the code!

  • @serman5671
    @serman5671 Před rokem +1

    so well explained

  •  Před 4 měsíci +1

    Awesome statQuest, did not know you can also impute data using random forests :) How does the analysis of parameters (ntree, mtry) change if we are doing regression instead of classification? Would also love to see a regression example.

    • @statquest
      @statquest  Před 4 měsíci +1

      I've never used it for regression, but I'll keep that topic in mind.

  • @christiansetzkorn6241
    @christiansetzkorn6241 Před 3 lety +1

    Great stuff! Thanks!

  • @andreaballestero7780
    @andreaballestero7780 Před 3 lety +1

    This was very helpful, thank you!! :)

  • @mateuszjaworski2974
    @mateuszjaworski2974 Před 3 lety +1

    Hi Josh! It would be great if u could show us how after building random forest get some predictions on brand new data ;)

  • @wsgsantos
    @wsgsantos Před 5 lety +1

    Very good explanation! Thanks from Brazil! :-)

    • @statquest
      @statquest  Před 5 lety +1

      Muito obrigado! :)

    • @pedrosenna100
      @pedrosenna100 Před 5 lety +1

      @@statquest I am a professor in Industrial engineering course in Brazil and just discovered your channel, i simply loved the videos! i teach logistics but i was wanting to put some data science practices and your channel is just perfect, i can't thank you enough for the help you gave me being so didactic!

    • @statquest
      @statquest  Před 5 lety

      @@pedrosenna100 Hooray!!! I'm so glad to hear that my videos are helpful in Brazil. It's a beautiful country with an amazing culture. I visited once a few years ago and hope to visit again as soon as I can.

  • @vivianhu3389
    @vivianhu3389 Před 4 lety +1

    Super Clear! THANK YOU!

  • @PetalGamesStudios
    @PetalGamesStudios Před 4 lety +1

    Awesome video! Thanks again!

  • @ImGeneralJAckson
    @ImGeneralJAckson Před 10 měsíci +1

    that's it. I'm buying a shirt!

  • @hikikomorihachiman7491
    @hikikomorihachiman7491 Před měsícem +1

    Thank you

  • @balaji.r2735
    @balaji.r2735 Před 4 lety +1

    Thank you very much

  • @IamCaptainMan
    @IamCaptainMan Před 3 lety +1

    Thanks man, you're awesome!

  • @Wissro
    @Wissro Před rokem +1

    Thank you so much, could you perhaps make more R tutorials for machine learning techniques?

    • @statquest
      @statquest  Před rokem +1

      I'll keep that in mind.

    • @Wissro
      @Wissro Před rokem +1

      @@statquest Thanks for the quick reply!

  • @hiteshpant
    @hiteshpant Před 4 lety +1

    hi Josh, I really enjoy watching your videos and like the way you have made statistical topics so easy to interpret. Do you have a video for Feature Selection(varImp) using Random Forest?

  • @PaulO-mv6ku
    @PaulO-mv6ku Před 5 lety +1

    Brilliant - many thanks.

  • @andrezaluko
    @andrezaluko Před 6 lety +2

    Josh Starmer, I am your fan! You are very funny =D

  • @iBenutzername
    @iBenutzername Před 2 lety +1

    Awesome as always! Can I ask you to make a video about feature importance in RF models?

  • @AR_Wald
    @AR_Wald Před 3 lety +1

    Hooray!

  • @afcc777f
    @afcc777f Před 6 lety +14

    Can make a video about random forest for regressions in R ?
    Thanks

    • @statquest
      @statquest  Před 6 lety +8

      I've added it to the to-do list, but it might be a while before I get to it.

    • @afcc777f
      @afcc777f Před 6 lety +2

      thanks

    • @baherazzam8863
      @baherazzam8863 Před 6 lety +3

      Thank you! I am also looking forward to that

    • @cynical_dd
      @cynical_dd Před 6 lety +3

      Hi, Im hoping for this too! Pretty pleaseeee, thank you!

    • @rajatbhosale8188
      @rajatbhosale8188 Před 5 lety +2

      Even I would like to get that.

  • @benben0814
    @benben0814 Před 6 lety

    Hey Josh this is very helpful and thanks for all the work! Does your code include cross validation for the random forest?

  • @charliepierce6218
    @charliepierce6218 Před 4 lety +1

    Amazing!

  • @mamahotel1308
    @mamahotel1308 Před 5 lety

    Love this, thank you!

  • @lifeboston853
    @lifeboston853 Před 6 lety +1

    Hello Joshua, I watched all your videos and they are so awesome! Will you be able to teach us Shrinkage Method (Ridge, Lasso and PCR), Neural Network, Deep leaning, Image analysis, and video analysis?

    • @lifeboston853
      @lifeboston853 Před 6 lety

      Thanks so much! I am looking forward to all your future videos :)

  • @kaam975
    @kaam975 Před 2 lety +1

    and for the video of course :)

  • @moniquebrogan7206
    @moniquebrogan7206 Před 2 lety

    Thanks so much for your great videos. Do you cover Variable Importance in any of your videos?

    • @statquest
      @statquest  Před 2 lety +1

      Yes. The most conventional approach is with regularization: czcams.com/video/Q81RR3yKn30/video.html

  • @Pavijace
    @Pavijace Před 6 lety

    ukelele...lol.....serious concept explained with fun...tq..keep goin...:-))am going home to you...nice song ...btw

  • @waasdelcolenwtn
    @waasdelcolenwtn Před 2 lety +1

    goat

  • @JoelAgarwal-yl2kw
    @JoelAgarwal-yl2kw Před rokem

    Hi Josh! Amazing video - has been super helpful in my understanding. Quick question, how would I find the AUC and ROC curve for the random forest model based on the code that you made? I'm trying to compare different models to see which is best (as well as compare to logistic regression).

    • @statquest
      @statquest  Před rokem

      I show how to do that exact thing (AUC and ROC for random forest) in this video: czcams.com/video/qcvAqAH60Yw/video.html

  • @imanep4902
    @imanep4902 Před 4 lety +2

    BAM haha thank you!

  • @fritz3555
    @fritz3555 Před 5 lety +1

    Thanks for the great video series. What about randomForestSRC package? If we have data with missing values, is it better to use the randomForestSRC package? Or should we use the randomForest package?

    • @statquest
      @statquest  Před 5 lety

      Unfortunately, I’ve only used the randomForest library, so I can’t tell you which one is better.

  • @user-bz8nm6eb6g
    @user-bz8nm6eb6g Před 4 lety +1

    Thanks!!

  • @abohisham3088
    @abohisham3088 Před 4 lety +1

    helpful and funny, continue

  • @TheEyeofJun
    @TheEyeofJun Před 5 lety +1

    Hooray!!!

  • @thuanpin
    @thuanpin Před 5 lety

    Hei, Thanks so much for your great lecture. May I ask you questions?
    1) why did you label for sex and hd, not for the other categorical variables? the levels of ca and thal changes after converted, do they influence to model?
    2) Do we need normalize continuous variables before conducting random forest?
    Many thanks!

  • @zainabkhan2475
    @zainabkhan2475 Před 4 lety +1

    Thanks for this video but I have a question,
    can you ad codes or some example for the RF regression?
    Please...

  • @user-uz1wz4gp9d
    @user-uz1wz4gp9d Před 5 lety

    Fantastic vedic! Very clear!
    Just have one more question, does RandomForest work with multiple columns of missing?

  • @reimiranda3213
    @reimiranda3213 Před 4 lety +1

    If you have any ecology examples for these stat quests that would be really useful!

  • @SergeySkripko
    @SergeySkripko Před 5 lety +1

    Josh, you used cmdscale() on a default dist(method="euclidian") matrix. Does it mean that you did PCA, according to your MDS and PCoA video?

    • @statquest
      @statquest  Před 5 lety +1

      Great question! Technically you could say that we did PCA on the distance matrix - but PCA is generally thought of as being applied to the raw data and MDS is applied to a distance matrix. So the difference is sort of in the spirit of how the data is processed, which is relatively minor.

  • @MahdiSafarpour
    @MahdiSafarpour Před 3 lety

    I have two questions about optimization of RF hyperparameters (mtry and ntree):
    1) Should we first find the optimum number of trees and then optimum number of variables? or we must consider the effect of these two parameters simultaneously?
    2) In this video, we examined the pattern of OOB when number of tree increases. Is this a good decision rule to choose the optimum number of trees just based on OOB? or it is better to use other methods such as cross-validation (I am looking for a way to find the best bias-variance tradeoff )?

    • @statquest
      @statquest  Před 3 lety

      1) Ideally you would find them simultaneously.
      2) Depending on who you talk to, you'll probably hear both methods as optimal.

    • @MahdiSafarpour
      @MahdiSafarpour Před 3 lety

      @@statquest Thank you so much for your reply. May you please introduce me any article or book that talks about this topic (optimization of RF hyperparameters) in more details!

    • @statquest
      @statquest  Před 3 lety

      @@MahdiSafarpour Here's a great place to start: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm In it Breiman says that the only important parameter is the number of variables selected for each tree.

    • @MahdiSafarpour
      @MahdiSafarpour Před 3 lety +1

      @@statquest Thanks a lot for your help.

  • @bryanparis7779
    @bryanparis7779 Před 2 lety +1

    hd variable when converted as a categorical should have 4 levels:"0","1","2","3". Instead we used the if-else function in order to produce only 2 levels of "0" and "1"...Why is that?

    • @statquest
      @statquest  Před 2 lety

      Because I wanted to simplify the problem to only identify whether or not someone had heart disease.

    • @bryanparis7779
      @bryanparis7779 Před 2 lety +1

      @@statquest Τhank you for answering:)

  • @jacquelinmontoyahidalgo6714

    great video! do u have any tutorial with regression random forest?

  • @olivermcneice8440
    @olivermcneice8440 Před rokem

    I had to add 'as.factor(myOutputVariable)' because it needs to be numeric.

  • @rrrprogram8667
    @rrrprogram8667 Před 6 lety

    Visiting again and again

  • @sam_AI_Dr
    @sam_AI_Dr Před 6 lety +1

    Hello Joshua, at the point where you were determining the optimal number of variables at each internal node, is there a reason why you selected the empty vector length to be 10?

    • @Rpekeno
      @Rpekeno Před 6 lety

      I'm new to this and have been wondering, this is the thing they call "curse of dimensionality" isn't it? You wanted to make sure you didn't try out too many variables (increasing dimension, and thus overfitting) or too few variables, did I get it right?

  • @monicasteffimatchado1780
    @monicasteffimatchado1780 Před 4 lety +1

    Thank you so much for the clear explanation. I have a microbiome datasets with 133 samples 431 features. I would like to try RF. How do I decide the range of mtry value ?

    • @statquest
      @statquest  Před 4 lety +2

      I talk about this in the original Random Forest video: czcams.com/video/J4Wdy0Wc_xQ/video.html You start with the default, which is the square root of number of variables, but can use cross validation to try other values.

  • @joshstat8114
    @joshstat8114 Před 6 měsíci

    Fellow "Josh". Thanks to this video. Can you have a part 2 about random forest in R that uses `ranger` package? It still kinda the same but faster

    • @statquest
      @statquest  Před 6 měsíci

      I'll keep that in mind.

    • @joshstat8114
      @joshstat8114 Před 6 měsíci +1

      @@statquest thanks. I am looking forward to it

  • @SergeySkripko
    @SergeySkripko Před 4 lety

    Maybe a stupid question but I can't understand why do we use dist() function? In your video about imputing missing values, you told that "1 - proximity" means a distance between samples. I understand it. Why do we need to compute distance over distance? What's the point? As I see, every column, say column "i", in "1 - proximity" means distances between the "i"th sample and all other samples. And then we calculate the distance(?) between these distances of "i" and another sample, "j". That's weird :)
    On the other side,
    1. dim(1 - proximity) == n_samples * n_samples.
    2. dim(dist(1 - proximity)) == n_samples * n_samples (as well).
    This blows my mind. I see redundancy as a recursive call: dist(dist(dist(...(1 - proximity)))

    • @statquest
      @statquest  Před 4 lety +1

      You found a typo. I meant to call "as.dist()" instead of "dist()". We just want to convert our matrix of proximities into an object of class "dist".

    • @SergeySkripko
      @SergeySkripko Před 4 lety +1

      @@statquest thank you very much! I thought I just didn't understand something

  • @lucpr4501
    @lucpr4501 Před 4 lety

    Good Morning. Thank you for your video and your time. May I ask you why do you use the Random Forest package for a binary response variable (Y variable is equal to 0 or 1). Should not we use a Bernoulli loss function instead of the quadratic loss function when splits are performed in the tree?

    • @statquest
      @statquest  Před 4 lety

      For classification, randomForest() uses Gini impurity to decide if it should create a new branch. For more information about how Gini impurity is used, see: czcams.com/video/7VeUPuFGJHk/video.html

  • @gabrielcrone6753
    @gabrielcrone6753 Před 2 měsíci

    Hi, Josh. Excellent video! So helpful and clear! 😄I am using a new version of randomForrest, and I cannot seem to locate within my model object the err.rate vector. When I write, "model$err.rate", it returns nothing. Do you know if there are equivalent objects now inside of the model to extract the error rate info? Thanks!

    • @statquest
      @statquest  Před 2 měsíci

      What is the exact version you are using? 4.7-1.1 has err.rate. You can see it in the documentation here: cran.r-project.org/web/packages/randomForest/randomForest.pdf

  • @rubenpinnata4626
    @rubenpinnata4626 Před 4 lety

    hi Josh! Great videos as always
    a quick question: once you have declared a variable as factor, can you use MDS?
    you said it is very similar to PCA and from what I know, PCA needs scaling which I am not sure will work with categorical variables until you hot encode them, which I dont see any here.
    Can you verify its okay to use MDS plot for data with both continuous and categorical?
    thanks and stay safe

    • @statquest
      @statquest  Před 4 lety +1

      We apply MDS to the proximity/distance matrix, which is not the same thing as applying it to the raw data. In other words, the process of creating the proximity matrix converts the factors into distances that are suitable for MDS.

    • @rubenpinnata4626
      @rubenpinnata4626 Před 4 lety +1

      @@statquest perfect! Thanks as always Josh

  • @srinivasv3268
    @srinivasv3268 Před 5 lety +2

    Hi, Could you please upload some multi class prediction, example : we have one train and test data set first we need predict train data than prediction on test data
    Thanks

    • @statquest
      @statquest  Před 5 lety

      I've only done multi-class prediction in Python, but the documentation for randomForest (the R package) indicates that, just like with Python, there's no difference between predicting two classes and predicting more than two classes.

  • @amitt9053
    @amitt9053 Před 5 lety

    How to fill in missing values if they are numeric? (For classification samples could be created using possible classes say Y or N)

  • @RPDBY
    @RPDBY Před 6 lety +1

    Thank you for the great tutorial. I am confused though why do we need to impute our outcome variable, is it justified? Wouldn't it be more reasonable to treat the NAs in our outcome variable as unlabeled data and train the model on labeled data only? Imputing an outcome variable seems like a dubious practice, but maybe i am wrong.
    Also, on a technical side, how can we access the actual predicted values per id (i.e. in this case per patient)? Thanks a lot for the video once again!

    • @statquest
      @statquest  Před 6 lety +1

      In an ideal world, you would never have to impute anything. But in practice, sometimes data isn't complete and you don't have a lot of it. So, in these situations, you may not have a choice - it's definitely not ideal, though. Your word, "dubious" is a good description!
      You can get the predicted values, which correspond to the the rows in the input data, with "model$predicted".

    • @RPDBY
      @RPDBY Před 6 lety +1

      Thank you so much for the prompt answers!

  • @fishfeelpain7764
    @fishfeelpain7764 Před rokem

    Isn't the confusion matrix built with the predicted class as rows, and observed class as columns?

    • @statquest
      @statquest  Před rokem

      Not always. Unfortunately there is no standard practice.

  • @random-ds
    @random-ds Před 5 lety +1

    Hello, thank you again for you excellent video, however, I still have on question: what is the difference between what you did (RFimpute with 6 iterations) and the MissForest algorithm
    Thank you again!

    • @statquest
      @statquest  Před 5 lety

      That's a good question. Unfortunately, I've never used MissForest, so I can't tell you the answer.

  • @ioanastanescu6690
    @ioanastanescu6690 Před 2 lety

    Hey everyone, quick question. When you start building the model you write set.seed(42). Where does that 42 come from? Thanks for the videos, they are really great! :)

    • @statquest
      @statquest  Před 2 lety +1

      See: en.wikipedia.org/wiki/42_(number)#The_Hitchhiker's_Guide_to_the_Galaxy

    • @drtlfletcher
      @drtlfletcher Před 2 lety

      @@statquest That is both funny and somewhat unhelpful! Do you mean that the set.seed value can be anything and you chose 42 because you like Douglas Adams?
      What parts of the rest of the tutorial will be affected by the 'set.seed' function? Is this just applicable to rfImpute, or will this impact our randomForest function as well?

    • @statquest
      @statquest  Před 2 lety

      @@drtlfletcher When we both set the seed for the random number generator to the same number (any number, as long as we use the same number), then we well both get the same results, even though a lot of the process is "random". Setting the seed effects any "random" events that follow the call to set.seed(), so it effects rfImput as well as randomForest.

  • @charangrewal6113
    @charangrewal6113 Před 6 lety +1

    How do we know which variables the random forest chose to use in the final model?

    • @statquest
      @statquest  Před 6 lety +1

      If you build a random forest...
      model