StatQuest: Random Forests Part 2: Missing data and clustering

Sdílet
Vložit
  • čas přidán 14. 01. 2020
  • NOTE: This StatQuest is the updated version of the original Random Forests Part 2 and includes two minor corrections.
    Last time we talked about how to create, use and evaluate random forests. Now it's time to see how they can deal with missing data and how they can be used to cluster samples, even when the data comes from all kinds of crazy sources.
    NOTE: This StatQuest is based on Leo Breiman's (one of the creators of Random Forests) website: www.stat.berkeley.edu/~breima...
    For a complete index of all the StatQuest videos, check out:
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Buying The StatQuest Illustrated Guide to Machine Learning!!!
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    CZcams Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    #statquest #randomforest

Komentáře • 428

  • @statquest
    @statquest  Před 4 lety +52

    NOTE: This StatQuest is the updated version of the original Random Forests Part 2 and includes two minor corrections.
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @jacksmith870
      @jacksmith870 Před 4 lety

      hey josh how about a video on parameter estimation and another video on how to interpret complex math equations into more of an intuition. thanks a lot.

    • @statquest
      @statquest  Před 3 lety +1

      @@ytpub01 Both corrections start at 1:15 and are related to how the classification (having heart disease) is used to select which other samples are used for the initial guesses. Originally I omitted this detail.

  • @syle3668
    @syle3668 Před rokem +17

    Initially I feels the uklele sound is awkward, now after 3 days of aggressive learning, that sound make me relief so much.
    Thank you Josh.

  • @qusaibasem9022
    @qusaibasem9022 Před rokem +44

    Thank you so much for the amazing videos Josh, I don't know what I would have done without those videos in my Data Science journey!

    • @statquest
      @statquest  Před rokem +19

      Wow! Thank you so much for your support!!! It means a lot to me that you care enough to contribute. BAM!!!

  • @williamrinauto1498
    @williamrinauto1498 Před 2 lety +11

    I love these videos because when you get to a concept I don't fully understand, you follow up with.. "check out the statquest for it"... I started with gradient boosting but paused the video and have been detouring for an hour now covering your pre-req videos, and including a couple pre-req's to pre-req's.
    An hour in, sipping on a beer, and I can feel myself getting smarter. A huge advantage to the way you do your videos is that I don't have to pace myself to learn only a concept, or part of a concept, a day. I can binge and stay engaged. Great stuff.

    • @statquest
      @statquest  Před 2 lety

      Awesome!!! I'm glad you like the videos! :)

    • @3bwhabsodium
      @3bwhabsodium Před 2 měsíci

      same , I started from the gradient boosting - > ada boosting -> random forest

  • @shannatheragamuffin..743
    @shannatheragamuffin..743 Před 2 lety +4

    I just loved the video. The way you explained what is proximity matrix and how we can calculate distance matrix from it was the best part. None of the websites explained that part. Thanks for making this useful video. You totally nailed it!

  • @nicolegroene2053
    @nicolegroene2053 Před 4 lety +4

    I just came across your videos and love them! They explain stats in such an intuitive way. They provide a perfect overview that makes it so much easier to digest formulas and code later on. Triple Bam , thumbs up and a biiig thank you!

  • @ahanadrall5661
    @ahanadrall5661 Před 3 lety +14

    These Stat Quests are building my life! I really would like to advice our college professors to learn from here and then teach in college!! But i can't do that XD, so instead, I advised all the students to learn from here! all of them love it!
    Thanks Josh!!

  • @rajarajeshwaripremkumar3078

    This idea is amazing!! Never thought of RF being used for clustering.. just amazing!!

    • @statquest
      @statquest  Před 4 lety +4

      Isn't that cool? Yes, I love that.

    • @rajarajeshwaripremkumar3078
      @rajarajeshwaripremkumar3078 Před 4 lety

      @@statquest Yea!! Strange that we never realized that it is "eventually" doing similar job as KNN or Kmeans to predict.

    • @statquest
      @statquest  Před 4 lety +17

      @@rajarajeshwaripremkumar3078 Except random forests can do something KNN can not - cluster with categorical features, or with a combination of categorical and continuous features.

    • @davidesouzafernandes6345
      @davidesouzafernandes6345 Před 11 měsíci

      @@statquest How could I use the Random Forest algorithm to calculate the genetic distance? I really love that idea. I would like to try in my DNA barcode samples and them run a cluster of observations. Just for fun! Thank you.

    • @statquest
      @statquest  Před 11 měsíci

      @@davidesouzafernandes6345 You could build trees based on whether or not there was a SNP at a certain locus. Or something like that.

  • @Hien6611
    @Hien6611 Před 4 lety +7

    I really appreciate all of your videos.. I am surviving this semester with your awesome, kind, amazing videos. :)

    • @statquest
      @statquest  Před 4 lety

      Awesome! Good luck with your classes. :)

  • @RaviPrakash-dz9fm
    @RaviPrakash-dz9fm Před rokem +7

    Thanks a lot for these gems! Have an interview coming up and needed a refresher!

    • @statquest
      @statquest  Před rokem +3

      BAM!!! Thank you so much for supporting StatQuest!!! :)

  • @tymothylim6550
    @tymothylim6550 Před 3 lety +2

    Thank you very much for this video! It was fun to watch and I learnt a lot from the step-by-step process of adding in missing data!

  • @mohitgu123456
    @mohitgu123456 Před 4 lety

    Hello Sir, You made a brave attempt in explaining this topic in a simple manner but honestly speaking this has gone above my head. I need practice a lot before I enter this area.

    • @statquest
      @statquest  Před 4 lety +1

      Did you watch Part 1 of the video? If not check out: czcams.com/video/J4Wdy0Wc_xQ/video.html

  • @adamdeuxieme
    @adamdeuxieme Před 3 lety +1

    Your video hypes me up ! I'll try all those tricks this spring break 👍

  • @murilopalomosebilla2999
    @murilopalomosebilla2999 Před 2 lety +1

    RFs are simply amazing. They can predict well, are a valuable variable selection tool, and now you are telling me they can produce a similarity matrix too!

  • @farshadmashhadi4931
    @farshadmashhadi4931 Před rokem +2

    Great material!

  • @saadci53
    @saadci53 Před 3 lety +2

    Laughing and learning ? That's how it's supposed to be, cheers to you man !

  • @yoyomemory6825
    @yoyomemory6825 Před 4 lety +2

    thanks for the plain explanation!!! great job

  • @davemartin9350
    @davemartin9350 Před 4 lety +3

    Thanks for sharing, with a real talent for explaining

  • @jubaal91
    @jubaal91 Před 4 lety +1

    I really love your videos!!
    Thanks!
    Greetings from Mexico

  • @szco9814
    @szco9814 Před 2 lety +1

    Dude! You are better than my professor who taught me the RF three years ago!

  • @ksiddarthadshetty3334
    @ksiddarthadshetty3334 Před rokem +1

    Did not know that random forests can help in missing value imputation. Thank you 👍

    • @statquest
      @statquest  Před rokem +1

      They really are quite cool. However, I've only found that the R version implements these features.

  • @eyalbaum1254
    @eyalbaum1254 Před 2 lety +1

    your videos are the BEST.

  • @anubhavsoni7620
    @anubhavsoni7620 Před rokem +2

    your teaching is amazing

  • @soyjbm
    @soyjbm Před 4 lety +1

    Very good explanation. Thanks.

  • @Coffee_is_all_you_need
    @Coffee_is_all_you_need Před 2 lety +1

    DUDE YOU ARE AMAZING :"D

  • @pawanpant9707
    @pawanpant9707 Před 3 lety +1

    Hurrey!!!!!!! Hundred Bam!!!!!!!! for your video!! You are awesome. In India people consider teacher as a god and you've become a god of millions of people. Thanks a lot Josh

  • @kristenli1161
    @kristenli1161 Před 3 lety +2

    When Josh said this is clearly explained, it's no joke.

  • @pandharpurkar_
    @pandharpurkar_ Před 3 lety +1

    You are really great person..!

  • @assaadmrad8767
    @assaadmrad8767 Před 2 lety +1

    This is the video that made me want to become a BAM! StatQuest member :)

    • @statquest
      @statquest  Před 2 lety

      HOORAY!!! Thank you so much for supporting StatQuest!

  • @koolmo
    @koolmo Před 4 lety +1

    who does not love StatQuest Songs? can't resist...

  • @julissacotillo
    @julissacotillo Před 3 lety +1

    the tree parsing noises are the best

  • @magtazeum4071
    @magtazeum4071 Před 4 lety +1

    I need a collection of these intro songs which helped me to relax exam stress

    • @statquest
      @statquest  Před 4 lety +1

      I need to make one. :) Good luck with your exams.

    • @magtazeum4071
      @magtazeum4071 Před 4 lety +1

      @@statquest Aww..Thank you so much Josh

  • @Raven-bi3xn
    @Raven-bi3xn Před 3 lety +2

    Super great video. So much info in a concise and effective manner. Just FYI, no big deal, at 1:55, is 167.5 the mean as opposed to median?

    • @statquest
      @statquest  Před 3 lety +4

      It's actually both the mean and the median. Both the mean and median of the two weights associated with people that did not have heart disease, 125 and 210, are 167.5. That said, in general, the mean and median are not equal and in that case you should use the median.

  • @Gautam1108
    @Gautam1108 Před rokem +1

    Dear Josh, I just purchased your 'Illustrated Guide to Machine Learning' and I wanted to tell you 2 things:
    1 - It really is amazing and the content is explained very well, visually - which is essential. Thank you
    2 - I was, though, a tad bit disappointed to not find Random Forests, XGBoost, PCA, LDA, etc. in it. But I get it - it's only a $20 book. However, I wanted to ask if you would/could release another book containing these slightly advanced aspects which you did not cover in the Illustrated Guide? Please let me know. Looking forward to hearing from you.

    • @statquest
      @statquest  Před rokem +1

      Yes, I want to write a book about those topics. First I'm working on a deep learning book, but a book on tree-based methods will follow soon.

  • @maodian9779
    @maodian9779 Před 4 lety +2

    Thanks for the great video. Can you please make a video on linear mixed models and talk about random effects?

    • @statquest
      @statquest  Před 4 lety +1

      That's on the to-do list. I'm working on XGBoost right now. Next is Neural Networks, then time series. After that I can work on mixed models.

  • @dansolpa
    @dansolpa Před 2 lety +1

    Hello Josh! First, I want to thank you for your awesome videos! I really enjoy watching them and I learn a lot! You put a lot of magic on them, Thank you a lot! you deserve the nirvana, the heaven, everything!
    Second, I have two questions,
    first one: why are you so great? jajaja
    and second one: If the sample with Null values is replaced with the most repeated value in the samples that have the same objective variable as the one with null vales. In this case, the value "NO" in the "Blocked Arteries" column. And then, with this replacement, the Random Forest is created, there will be no bias when building the model?
    And that bias will make that in the refining of the guess, the final value will always be the one that was used as the replacement?
    (In this case, the final value was "NO" for the column "Blocked Arteries")
    Thanks for reading me! Hope you can help me with this doubt. Have a wonderful day!

    • @statquest
      @statquest  Před 2 lety

      Say like we put "no" in the blocked arteries column for a sample, but all of the other values in all of the other columns are similar to samples that have "yes", then we will probably end up changing the guess to "yes"

  • @khaikit1232
    @khaikit1232 Před rokem

    Hey Josh thank you for the amazing video!
    In your example, you demonstrated how random forest can deal with categorical missing data when classifying a new sample, how about the following scenarios:
    1) Continuous missing data for a classification problem
    2) Categorical missing data for a regression problem
    3) Continuous missing data for a regression problem

    • @statquest
      @statquest  Před rokem

      First, remember that regression trees still have discrete output values - they bin the possible output values. So, in this case, we can just treat them as classification trees with more options. Thus, for question #1 and #3, we can just plug in the median value for each possible output value and follow the steps shown earlier in the video. For #2 we plug in the categorical options for each possible output value.

    • @khaikit1232
      @khaikit1232 Před rokem

      @@statquest Is there a typical way that the output values for regression trees are binned? How would I know how many bins are there? Or is the number of bins typically a hyperparameter?

    • @statquest
      @statquest  Před rokem +1

      @@khaikit1232 To be honest, I'm not sure off the top of my head how you'd do it.

  • @hadisharifi79
    @hadisharifi79 Před 4 lety

    Josh, it would be great to provide some reference books for topics too. There are so many books that one can hardly know which one is the best.

    • @statquest
      @statquest  Před 4 lety +2

      I wish I could, but the only book I every use is the Introduction to Statistical Learning (which is a free download). Other than that I read the original manuscripts (and, if I remember, I provide links to them in the video's description....but sometimes I forget to add the links).

  • @ashwinpjoby6378
    @ashwinpjoby6378 Před 3 lety +1

    Dam ! you still reply to everyone , even after a year . Quadruple BAM!!

  • @bernardodagostino9590
    @bernardodagostino9590 Před 3 lety +1

    thank you

  • @IgnatPenshin
    @IgnatPenshin Před 3 lety +1

    Hey Josh!
    Really cool lesson. Thank you!
    But I have a little question:
    The situation on 10:40 has 4 samples.
    Is it enough to make good guess using an iterative method to predict BA for every HD outcome?
    Didn't we need to run all 4 samples down the trees in the forest?
    Thank you

    • @statquest
      @statquest  Před 3 lety +1

      No, we just use the exact same iterative method described before. We use the most common value for BA given the HD status and then create the similarity matrix and adjust. For more details, see: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

  • @paxjitjaruek2
    @paxjitjaruek2 Před 10 měsíci

    Hi! An insightful yet simple video. I was wondering how there can be multiple random forests? In the part 1 video there seem to be the “best” random forest that we only use. Would love to know more!

    • @statquest
      @statquest  Před 10 měsíci

      What time point, minutes and seconds, are you asking about?

  • @mandycxxxxxx
    @mandycxxxxxx Před rokem

    Great stuff!!!! Can you talk about clustering using random forest please? Thanks!!!!

    • @statquest
      @statquest  Před rokem

      Umm.... That's what I talk about in this video...

  • @ahaha731
    @ahaha731 Před 2 měsíci +1

    Nice humming. So alive

  • @widgetlad5157
    @widgetlad5157 Před 3 lety +1

    I love this at double speed

  • @veducatube5701
    @veducatube5701 Před 4 lety +1

    thanks for covering up evry thing

  • @camilogarcial11
    @camilogarcial11 Před 3 lety

    Thanks for all your videos. Have you ever considered like a remake of all the topics you've taught us but while using R in your examples? I would pay for it.

    • @statquest
      @statquest  Před 3 lety +1

      Are you asking about the webinars or the individual videos (like this one). If you are asking about Random Forests, I have a video (and code) that shows you how to do it in R: czcams.com/video/6EXPYzbfLCE/video.html

  • @RAZ122333
    @RAZ122333 Před 4 lety +12

    creepy triple BAM was the best BAM ever!!!!

  • @athulsrinivas661
    @athulsrinivas661 Před rokem

    Josh, great explanation thanks for the video, though I had a question so the above method you discussed for handling missing values seems to make sense for classification but I don't understand how it would for regression? is this entire discussion for classification using random forest?

    • @statquest
      @statquest  Před rokem

      Unfortunately I couldn't find documentation on how this would work for regression. :(

  • @gauravsirola5064
    @gauravsirola5064 Před 2 lety

    Hello Josh! Thanks for the video. Really Helpful. Need a clarification and one follow-up question. For the 2nd type of missing value i.e., missing value in the blocked arteries and heart disease (as we have to predict it).
    One clarification: At this stage, we have completely built the random forest model and training data is no longer being used. Is this understanding correct?
    If not, then I misunderstood 2nd type of missing value. If yes, below is the follow-up question:
    Following are the steps:
    1. Make copies for the sample with both possible outcome (Yes and No for heart disease). - Understood.
    2. Use iterative method for estimating the good guess. - At this stage, I have following question:
    Question: We have built the random forest so we won't have training data but just the random forest model. So how do we get the good guess based on just the model?

    • @statquest
      @statquest  Před 2 lety

      Yes, the second type of missing values only works if we have already trained the random forest. However, hold onto that training data, because you'll need it for this method.

  • @user-mf1xr5ki9j
    @user-mf1xr5ki9j Před rokem +1

    Great video indeed thanks!
    So once you are done fist type of missing data (i.e. during training) you retrain a random forrest but now using the full dataset right ?

  • @7kiwieee
    @7kiwieee Před rokem

    Thanks for the intuitive video. Is the assumption 'missing at random' for RF proximity imputation?

    • @statquest
      @statquest  Před rokem

      What time point, minutes and seconds, are you asking about?

  • @maciejodziemczyk5249
    @maciejodziemczyk5249 Před 3 lety +1

    Hi Josh,
    first: thanks for the video, wonderful as always,
    second: I'm trying to implement Random Forest data imputation algorithm (presented in your video) in python and I have some questions to you:
    1. Can we said what RF hyperparameters for every "refine guess step" should be? Especially max_depth, I think that other hyperparams aren't as much important and we can leave them default, but default max_depth is None => very deep trees, kind of overfitting.
    2. Follow up to the above: What about the random state of our RF's -> should it be fixed? If yes, then "the same" RF's are trained on better and better data at each iteration (our data may converge), if not then training is more random, but impossible to repeat (our data may not converge - my observations).
    3. What our stop criterion should be? With percentage change (x_{it} - x_{it-1})/x_{it-1} there is a risk of dividing by zero.
    4. Please tell me if I'm right: We can write our weighted average formula as dot product of the proximity matrix (each row divided by its sum) and dataframe values, but we can't include our imputed values, so we have to set it to the zeros before the dot product?

    • @statquest
      @statquest  Před 3 lety +1

      I'm glad you're implementing the algorithm, but, I'll be honest, I don't understand your questions about your work. That being said, the R implementation has these features, and you can look at that implementation for guidance. If you want to get a quick overview in how to get random forests working in R, see: czcams.com/video/6EXPYzbfLCE/video.html

    • @leowhynot
      @leowhynot Před rokem

      did you succeed in what you wanted to do ?

  • @doyelmukherjee2769
    @doyelmukherjee2769 Před 3 lety

    Thanks for this wonderful video..you are excellent..I have a question..when we classify new patient how do we deal if our missing variable is numeric(say weight is missing)?

    • @statquest
      @statquest  Před 3 lety

      Presumably use the median values for the two categories to impute the missing value.

  • @vivekjain2542
    @vivekjain2542 Před 2 lety +1

    Hey Josh, explanation for filling the missing data in a new sample for a categorical variable such as Blocked Arteries was simply awesome. But what about a continuous variable such as Weight? How do we calculate missing weight data for a new sample?

    • @statquest
      @statquest  Před 2 lety +2

      It's the exact same as when we did it at 1:45, however, just like we did for Blocked Arteries, we do it for both categories.

    • @vivekjain2542
      @vivekjain2542 Před 2 lety +1

      @@statquest Thank you for the confirmation 😄

  • @ASoumakie
    @ASoumakie Před 4 lety +1

    Thanks, Josh. Apparently, the missing date issue in Random Forest is computationally very expensive when you have a massive number of sample date and many missing variables values. Is this fully automated in software packages like SAS and Python?

    • @statquest
      @statquest  Před 4 lety

      As far as I can tell, it is not implemented for Python, however, it is for R and I have a demonstration of how it to use it here: czcams.com/video/6EXPYzbfLCE/video.html

  • @fawadnazir6978
    @fawadnazir6978 Před 4 lety

    Hi great video as always. Btw can their be a scenario where for both the classes ( YES,NO) in the testing data the tree traversal count is same i.e when the missing column is the most important value in deciding if the target is True or False.

    • @statquest
      @statquest  Před 4 lety +1

      Possibly. If that is the case, it probably just goes to the left (assumes true). That's what XGBoost does by default.

  • @suhasiyer1613
    @suhasiyer1613 Před 4 lety +1

    Hi, I have a question, for missing data #2 scenario (a new sample presented to us) how will we guess the entry if it is a continuous variable? maybe weight in the above dataset. Btw awesome content totally enjoying it BAM!!!

    • @statquest
      @statquest  Před 4 lety +3

      Are you asking about what would happen at 10:08 if we had to guess "weight"? We would create two copies, one for Has Heart Disease and one for Does not have heart disease, then we would put the median weight for people that have heart disease and the median value for people that do not have heart disease, and then we would do the proximity thing to refine the guess.

    • @Malikk-em6ix
      @Malikk-em6ix Před 3 lety

      @@statquest Firstly thankyou for your efforts and I have a question related to this, you mentioned putting the median weight for HD and No HD, Is the median weight calculated from training dataset or test dataset ? Also, we repeat the process until the values converges but which value 'missing value' or 'proximity value'?

    • @statquest
      @statquest  Před 3 lety

      @@Malikk-em6ix 1) Training dataset 2) Until the missing value no longer changes very much

  • @roy6378
    @roy6378 Před 3 lety +2

    Don't know if someone can answer this question but for dealing with missing info in the new data, when Josh mentions to run the 2 copies of the data through the iterative process, he just means to run it through all the trees in the existing Random Forest we have already built right? (i.e. we're not new building new trees as a part of that process)

    • @statquest
      @statquest  Před 3 lety +1

      That is correct. We do not build new trees to determining the missing values.

    • @roy6378
      @roy6378 Před 3 lety +1

      @@statquest Awesome. Thanks for the response! And just mirroring what everyone else is saying, love your videos. Your explanations are always amazing! Will definitely make a contribution on Patreon once I get out of this financial rut I'm in.

  • @samuelws1996
    @samuelws1996 Před 4 lety +2

    I just started learning random forest. Do you know if the inputation for missing data you explained is automatically done by the sklearn random forest regressor? or is this something that we have to code ourselves?
    Thanks

    • @statquest
      @statquest  Před 4 lety +1

      I'm not 100% sure, but I do not think the sklearn random forest automatically imputes missing values. Instead, it must be done separately.

    • @ASoumakie
      @ASoumakie Před 4 lety +1

      @@statquest Thanks Josh, are you aware of any software package that would impute missing data for random forest the same way you described?

    • @Mohanrambalaji
      @Mohanrambalaji Před 4 lety

      Y'all can use this. Works for me.
      from sklearn.impute import SimpleImputer
      imputer = SimpleImputer(missing_values=np.nan, strategy='median')

    • @kevingao6279
      @kevingao6279 Před 4 lety

      Mohanram balaji hey, does this method impute median of all data, or just median for the ‘no’ class?

  • @LQNam
    @LQNam Před 2 lety +1

    Thanks, Josh. How about using the random forest in regression to impute and cluster data?

    • @statquest
      @statquest  Před 2 lety

      Sure, you can do that. Just replace the classification trees with regression trees.

  • @knightedpanther
    @knightedpanther Před rokem

    Thanks Josh for the amazing video. I have two questions and also my opinions on them, if you can please help guide my thinking:
    1. How do we deal with outliers in the training data before fitting Random Forests?
    Here is what I think:
    I think outliers can mess with the performance of Random Forests and we should remove them. This is because Random Forests will have tendency to split and isolate those outlier regions whenever that feature is selected which has outlier values. If the y values contain outliers, the problem is even more severe.
    2. Does scaling the data before fitting random forest increase the performance of Random Forests? Or for that matter Gradient Boosting?
    Here is what I think:
    I think that scaling will not have an impact on performance as when we are splitting we are only looking at the output values for all variables. The final metric, sum of squared residuals does not contain any X. We only iterate over a range oof possible values of X. So the scale of Xs should not have any impact on what X is selected.
    Please let me know if I got it right!

    • @statquest
      @statquest  Před rokem

      1) Outliers, in general, are problematic.
      2) For trees, in general, there is no need to scale the data.

  • @saranrajk6282
    @saranrajk6282 Před 4 lety

    hi great video man...I have one doubt we fill missing value by guess, then we'll classify it based on most frequent correct label(target variable) right?
    how can we make guess for single sample?

    • @statquest
      @statquest  Před 4 lety

      What do you mean by "single sample"?

  • @mahadmohamed2748
    @mahadmohamed2748 Před 2 lety

    Hi, great video! 10:35 For new data can't we use the same method for filling in null values as we did with the training data?

    • @statquest
      @statquest  Před 2 lety +2

      Yes, that's what we do, but we do it for both classes, since we don't know which class the new data belongs to at first.

  • @jc_777
    @jc_777 Před 4 lety

    3B1B, Welchlabs, and you, StatQuest.... If there were Avengers for teaching, you'd be one.

  • @sunaxes
    @sunaxes Před rokem +1

    When you got missing data and missing label + an existing forest.
    Do you use the existing forest to fill the missing values? Then use that same existing forest to predict the label?
    I feel a different forest should be used to estimate missing values, as the new sample influences the tree creation process… if someone knows… thank you!

    • @statquest
      @statquest  Před rokem

      In practice, we create separate forests. For details, see: czcams.com/video/6EXPYzbfLCE/video.html

  • @megatitchify
    @megatitchify Před 2 lety

    Thank you Josh for creating such a clear and easy to understand video. How could you make an initial guess if the Heart Disease column was numeric as well? When you are taking your initial guess you wouldn't be able to pick the patients with the same entry as in this example. I hope that make sense, thanks again.

    • @statquest
      @statquest  Před 2 lety

      To be honest, I don't know how this works for regression off the top of my head.

    • @megatitchify
      @megatitchify Před 2 lety

      @@statquest Thanks for replying. Perhaps a weighted average of the "Weight" entries where the weights are the difference between the known numeric "Heart Disease" values?

    • @carloquinto9736
      @carloquinto9736 Před 5 měsíci

      @@megatitchifyHey! I have the same doubt. Have you ever discovered the answer? Thank!

  • @87everlasting
    @87everlasting Před rokem +1

    Thank you a lot! clearly explained. I just wonder if sci-kit learn in python also deals with the missing value automatically. or we have to call other commands.

    • @statquest
      @statquest  Před rokem

      Unfortunately, the sci-kit learn implementation of random forests does not implement any of this cool stuff.

  • @ketiz18
    @ketiz18 Před rokem

    Just stumbled on your video; it has been very helpful! I wanted to verify something regarding the heatmap and MDS plot you showed in this specific video. Aren't these technically respective to the first ProxMat in this video (0.9 prox value b/w 3&4)? It threw me off for a second because they were shown next to the hypothetical ProxMat you created that showed sample 3 and 4 as being close as close can be (1.0 prox value b/w 3&4). 9:18

    • @statquest
      @statquest  Před rokem

      To be honest, it's been so long since I created this video that I can't give you a certain answer. However, I hope you can understand the main ideas.

    • @ketiz18
      @ketiz18 Před rokem +1

      @@statquest very understandable, all good!

  • @oshinpatwa9794
    @oshinpatwa9794 Před 3 lety

    Hi. This concept is really cool:) had a doubt though, can we handle the proximity based missing value concept using current sklearn package? If not how can we do it?

    • @statquest
      @statquest  Před 3 lety +1

      Unfortunately the sklearn implementation of Random Forests is terrible and does not include this feature. However, the R package does, so I recommend using that instead. Here's the code: czcams.com/video/6EXPYzbfLCE/video.html

  • @abhishekdnyate8508
    @abhishekdnyate8508 Před 4 lety

    Thanks for the explanation, very precise.
    One question about the last example , in case the missing data in the testing sample - if the instead of blocked arteries, weight was missing and the prediction column is also numerical instead of binary, what steps are recommended ?

    • @statquest
      @statquest  Před 4 lety

      You would create new "pseudo" observations for each output value in the trees.

    • @abhishekdnyate8508
      @abhishekdnyate8508 Před 4 lety

      @@statquest Thanks for reply, as I understand instead of 2 copies ,( yes/no), we will need to create N copies then find try to guess the missing data from features using proximity matrix.
      How should this proximity matrix be calculated for numerical columns ?

    • @statquest
      @statquest  Před 4 lety

      @@abhishekdnyate8508 In this case you'll be using regression trees, which still bin their output values in to discrete bins. For more details, check out the 'Quest on Regression Trees: czcams.com/video/g9c66TUylZ4/video.html

  • @verandahx2857
    @verandahx2857 Před 3 lety

    10:44 Then we use the iterative method we just talked about to blabla... I am a bit confused that what iterative method refered to. Does it include the proximity metric or not (i.e. just guess based on other samples)? Thank you

    • @statquest
      @statquest  Před 3 lety

      It is the exact method we just talked about. We calculate the proximity matrix over and over again until it converges.

  • @mehmetb5132
    @mehmetb5132 Před 4 lety +1

    Great videos, thanks a lot! Quick question: At minute 6:17, it feels like it should be: No = 0.1 * 1/3 + 08 * 1/3 and Yes = 0.1 * 1/3. Accumulating each rows influence onto the 4. row (unknown row). Otherwise, aren't we double counting?

    • @statquest
      @statquest  Před 4 lety

      Why do you think we are double counting? The weight function ensures that the sum of the weights for all of the categories (in this case, we only have 2 categories, but we could have) is equal to 1. Thus, the weights are normalized.

    • @mehmetb5132
      @mehmetb5132 Před 4 lety +1

      Thanks for the reply. My reasoning was:
      We are trying to predict the unknown value for the Blocked Arteries of the 4th row. Each other row (1,2, and 3) will help us to predict it, based on their Blocked Arteries value and how close (in the proximity) they are to the 4. row.
      2nd row has the value 'yes' and its proximity is 0.1. So it is saying 4th row should be 'yes'
      and the weight for Yes should be = 1/3 * 0.1
      1st and 3rd rows have the value 'no'. In other words, they are voting for 'no' for 4th rows unknown value.
      1st row's proximity is 0.1. The weight for 'No' by the 1st row is : 1/3 * 0.1
      3rd row's proximity is 0.8. The weight for 'No' by the 3rd row is : 1/3 * 0.8
      Total weight for 'No' = (1/3 * 0.1) + (1/3 * 0.8)
      (I think this approach is also aligned with the following section on how to calculate the numeric unknown values) (Sorry, my post became a bit lengthy, I just wanted make sure I am not missing the point). Thanks again!

    • @statquest
      @statquest  Před 4 lety

      @@mehmetb5132 Leo Breiman, who created random forests, says "[For] a missing categorical variable, replace it by the most frequent non-missing value where frequency is weighted by proximity." ( see: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 )
      I'm pretty sure that the frequencies of the "non-missing values" are 1/3 and 2/3s, and not 1/3, 1/3 and 1/3, because there are only 2 "non-missing values" and using 3 frequencies suggests that there are more than 2 "non-missing values". So, in this example we are creating weights for the frequencies of the non-missing values, 1/3 and 2/3. Does that make sense or am I miss-interpreting Leo Breiman's text?

    • @mehmetb5132
      @mehmetb5132 Před 4 lety +1

      @@statquest Thanks for the comment and the link for the original inventors page, awesome! To be honest, I am having a hard time to interpret the original text from Leo Breiman's sentence. I feel like it could go either way. It's great that he included the code for his implementation. It is in Fortran!!! I guess curiosity kills the cat; I tried to decipher it.
      (www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm)
      He has an 'impute' function, yay! There are two sections (loops), first one is for numerical values (if(cat(m).eq.1) then). Second, for categorical values (if(cat(m).gt.1) then)
      (www.stat.berkeley.edu/~breiman/RandomForests/cc_manual.htm#c2)
      Inside the second loop,
      1 do m=1,mdim
      2 if(cat(m).gt.1) then
      3 do n=1,near
      4 if(missing(m,n).eq.1) then
      5 call zervr(votecat,maxcat)
      6 do k=1,nrnn
      7 if (missing(m,loz(n,k)).ne.1) then
      8 j=nint(x(m,loz(n,k)))
      9 votecat(j)=votecat(j)+real(prox(n,k))
      10 endif
      11 enddo !k
      12 rmax=-1
      13 do i=1,cat(m)
      14 if(votecat(i).gt.rmax) then
      15 rmax=votecat(i)
      16 jmax=i
      17 endif
      18 enddo
      19 x(m,n)=real(jmax)
      20 endif
      21 enddo !n
      22 endif
      23 enddo !m
      He has an array called 'votecat' that he fills with zeros first (line 5)
      then in an inner loop,
      he accumulates (just adding) the proximity values. Seems like he is assuming that each row without missing value is equally participating (line 9)
      Finally, picking the maximum as the winner (line 15)
      I can't see that he is doing any additional work in term of frequencies, unless I am missing something?

    • @xixi1796
      @xixi1796 Před 3 lety +1

      ​@@statquest Hi josh, awesome video! But I think ​ @Mehmet B is correct. Say we have n rows without missing value. Each row just does an accumulating contribution of 1/n * proximity. You can't calculate the aggregate frequencies first and then apply the proximity scores on aggregate frequencies - where a proximity score is gonna make effect on other non-corresponding row(s). Leo Breiman's wording might be a little confusing, but I believe by saying "where *frequency* is weighted by proximity", he intends to mean a row-by-row manner instead of calculating frequencies overall. The above could also be reflected by the source code that @Mehmet B has shown.

  • @farooq36901
    @farooq36901 Před 2 lety

    Excellent video, however I did not fully understand the denominator used in calculating the weighted values from the proximity matrix.

    • @statquest
      @statquest  Před 2 lety

      It's basically the same technique used to calculate a weighted mean. www.statisticshowto.com/probability-and-statistics/statistics-definitions/weighted-mean/

  • @YousefRoshdy
    @YousefRoshdy Před 2 lety

    Thanks for the amazing videos. I have a question about missing values in the example in 11:25 min was
    the first one blocked arteries: Yes and Heart Disease: Yes
    the second one blocked arteries: No and Heart Disease: No
    do we have to make four of these like the below? just to cover all the probablity
    the first one blocked arteries: Yes and Heart Disease: Yes
    the second one blocked arteries: No and Heart Disease: Yes
    the third one blocked arteries: Yes and Heart Disease: No
    the fourth one blocked arteries: No and Heart Disease: No

    • @statquest
      @statquest  Před 2 lety

      No. At 10:40 I explain how we find the optimal values for "blocked arteries" without having to try all possible combinations.

  • @ratchainantthammasudjarit9917

    I like your video, however, at 10:51, I could not imagine how to come up with such imputed values for blocked arteries of those 2 samples. Since the iterative guessing looked for other samples (e.g. at 5:02) and use the proximity matrix, how can just one unseen sample (at 10:15) can do like that.

    • @statquest
      @statquest  Před 2 lety

      I'm not sure I fully understand your question, but at 5:02 we see that the one person that has hart disease also has blocked arteries, so it makes sense that, when we impute 'blocked arteries" for someone new that has heart disease, we would impute "yes" for blocked arteries. Likewise, the other people without heart disease at 5:02 also did not have blocked arteries. So it makes sense to impute "no" for someone new that does not have heart disease.

  • @20060802Lin
    @20060802Lin Před 4 lety

    I have a question. For missing data #2 scenario, if I have multiple labels, does that mean I need to make multiple copies with each potential label? Thank you so much!!

  • @sandeepganage9717
    @sandeepganage9717 Před 2 lety

    I am confused between entropy and gini index. Which method is used in libraries like sklearn to calculate impurity of decision tree?? Or which method should be used to split the nodes in decision tree?

    • @statquest
      @statquest  Před 2 lety

      I believe both methods are available, but GINI is default.

  • @huipingchen766
    @huipingchen766 Před 7 měsíci

    It’s awesome video. May i ask for the missing value in new sample, if it is regression problem how we did it?

    • @statquest
      @statquest  Před 7 měsíci

      I'm not 100% certain, but you could probably just plug in a bunch of values and test each one.

  • @amanoswal7391
    @amanoswal7391 Před 3 lety

    Random Forest builds multiple trees initially and then impute missing data using those trees(proximity table method). Now, these trees will be used for predicting the final output.
    Hence, filling the missing values in the training set had no effect on the prediction of our test dataset. Then why do we bother filling the missing values in RF?

    • @statquest
      @statquest  Před 3 lety +1

      Once you impute the data, you can build a new random forest with the full dataset.

  • @kshitijpemmaraju4177
    @kshitijpemmaraju4177 Před 4 lety

    time stamp 9.24 what does this black and blue colors in heatmaps determine and how are your correlating heatmap colors with distance matrix numbers?

    • @statquest
      @statquest  Před 4 lety

      To understand heatmaps, see: czcams.com/video/oMtDyOn2TCc/video.html

  • @aggelosdidachos3073
    @aggelosdidachos3073 Před 4 lety

    Regarding the last part, would it make sense to make two more assumptions and see how often they are predicted correctly using random forests? That is 1) Blocked Arteries = Yes , Heart Disease = No 2) Blocked Arteries = No, Heart Disease = Yes.

    • @statquest
      @statquest  Před 4 lety +1

      You could try that, but it makes more sense to create two samples, one that has heart disease and one that doesn't, and then use the iterative method described earlier to impute the best value.

  • @bimalkharel7231
    @bimalkharel7231 Před 3 lety

    Quick question - at 1:55 you say median but it's also equal to the mean, are they the same when only two values are sampled (and would be different in a larger dataset, so use median)?

    • @statquest
      @statquest  Před 3 lety

      Yes, in this example, the mean and median are the same because there are only two samples. If there were more, then, chances are, they would not be the same and we should use the median.

  • @hellochii1675
    @hellochii1675 Před 4 lety +1

    I’m wondering for the second part of this video, how should we guess the initial missing numerical values? What guessing options do we have in this case? Thank you~~~~ BamBam

    • @hellochii1675
      @hellochii1675 Před 4 lety

      In general, how should I take the initial guess if we have multi labels or even numerical continuous output instead of a binary classification problem? Should we creat multiple samples?

    • @statquest
      @statquest  Před 4 lety +1

      If a feature or variable is binary or categorical (or some other discrete value), we choose the option that is most common among observations that have the same outcome. For continuous values, we take the median value among observations with the same outcome. This is described starting at 1:12

    • @hellochii1675
      @hellochii1675 Před 4 lety +3

      @@statquest Thanks for the answer~~ I think my words are confusing. I am wondering the above method will work for binary or multilabel classification trees. What if we want to solve a random forest regression problem, where the outcomes are continuous values? How can we fill the missing values for the training samples (samples to create our forest) and the testing samples( samples that we want to predict the outcome using our forest)? Thank you BAM BAM BAM~

    • @statquest
      @statquest  Před 4 lety +4

      @@hellochii1675 That's a good question and I don't know the answer to it. The good news is that XGBoost does regression and has a way of dealing with missing values in that context - so we have options that are similar that we can use if we need to.

  • @phuonglehuy6257
    @phuonglehuy6257 Před 9 měsíci

    Great video. But there is something in type 2 that I wonder what your solution. What if the missing data is numeric type such as weight, how do we encounter it?

    • @statquest
      @statquest  Před 9 měsíci

      To be honest, I'm not sure I would use a random forest for a regression problem to begin with.

  • @preet111
    @preet111 Před rokem

    What do we mean by the iterative method when finding missing data in sample to classify, is that the one we used to interpolate the training data.

    • @statquest
      @statquest  Před rokem

      Yes

    • @preet111
      @preet111 Před rokem +2

      ​​@@statquest thanks, for the clarification. Love your work it's so interesting, i can binge watch the entire playlist.

  • @kangxiluo5821
    @kangxiluo5821 Před 3 lety +1

    Much thanks from a sophomore!

  • @20060802Lin
    @20060802Lin Před 4 lety

    I am really sorry to bombard you with questions.... I was just thinking before that after we turn a decision tree into a random forest, we lost some interpretability of the data, for example, which variable contributed the most to the final decision. After looking at random forest clustering (I love it! I haven't seen another resource that actually expands on this), I was thinking since we can turn it into a heat map or MDS, then we can still have some ideas about which variable matters the most right? I want to make sure I am getting the right idea.

    • @statquest
      @statquest  Před 4 lety +1

      Actually, Random Forests have a specific way to calculate variable importance. To quote from the source (the source is here: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm ) "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m."

    • @20060802Lin
      @20060802Lin Před 4 lety +1

      @@statquest Thank you so much! And you pointed to me the right keywords/content to Google (don't know why I didn't think of this myself before....) and now I have a lot more info. about this. This is so cool!

    • @statquest
      @statquest  Před 4 lety +1

      @@20060802Lin Hooray!!! And thank you so much for supporting StatQuest!!!! It really means a lot to me when someone cares enough to contribute.

    • @20060802Lin
      @20060802Lin Před 4 lety +1

      @@statquest Thank you for caring enough to answer all mine and others' questions :)

  • @teenradio1759
    @teenradio1759 Před rokem

    Thanks for this video! I have a question, you explained how to fill in missing data in a new object we want to predict, but you only explained how to do it when the missing data is binary. How would I do it for data that isn't binary like height or weight?

    • @statquest
      @statquest  Před rokem

      You start by using the median value.

  • @beckswu7355
    @beckswu7355 Před 3 lety

    at 2:21 step 1 when build the tree. Should be the tree with initial guessed value used? or only use the data with no missing values to build the tree? Thank you

    • @statquest
      @statquest  Před 3 lety +1

      I believe we build the data with the guessed value. For all of the details, see: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

    • @beckswu7355
      @beckswu7355 Před 3 lety +1

      @@statquest Thank you for the quick reply, You are the best explainer of ML I found ever. Thank you

  • @alex_zetsu
    @alex_zetsu Před 4 lety

    Hmmm, I'm wondering suppose you had a random sample and had data of 11130 patients about their chest pain; 120 about their blood circulation; 40 about if they had blocked arteries; 7 about their chest pain and blood circulation; 8 about their chest pain and if they had blocked arteries; 3 about their chest pain and weight; 31 about their chest pain and blood circulation and if they had blocked arteries; 30 about their chest pain, weight, and blood circulation; 10 about their chest pain, if they had blocked arteries, and their weight; and 371 with all 4 inputs. And of these 11175 patients we knew if they had heart disease or not. Would a random forest made with all the data be much better than one that dropped all the "one input" patients?

    • @statquest
      @statquest  Před 4 lety +1

      Try it! Interestingly, another method called XGBoost is designed to work with missing data like you have. I've put out 2 videos on it so far and I've got at least 2 more to go. I won't get to the part about dealing with missing data until part 4. Here's the link to part 1: czcams.com/video/OtD8wVaFm6E/video.html

  • @tudorpricop5434
    @tudorpricop5434 Před 10 měsíci

    At 10:33 after you created the 2 copies of the data, one with heart disease, and one without: I don't understand what do you mean by "the iterative method we just talked about" and how that guesses were selected.
    In the method you mentioned at 1:28 we have chosen the "most common value found in other samples", but this is a new sample that we want to categorize, not a training data, and confuses me a bit

    • @statquest
      @statquest  Před 10 měsíci

      At 10:33 we are trying to impute a value for Blocked Arteries, however, we don't know whether or not use the training data associated with people who have heart disease or the training data associated with people who do not. So we try both and use them to impute the value for Blocked Arteries. We then have to decide which one is better, and we select the one that is labeled correctly the most times.

  • @williamshyu9032
    @williamshyu9032 Před rokem +1

    Are there any statistical justificiations for these methods, or they are just engineering fixes?

    • @statquest
      @statquest  Před rokem +1

      It depends on the method - often statistics is playing catch up to ML, with the ML method created and the statistical justification for why it works so well coming later. But I believe this method started out with justification as the creator, Leo Breiman, is in the stats department at Berkeley.

  • @himanshuparida8813
    @himanshuparida8813 Před 7 měsíci

    @statquest after we find out the missing values the first time using the proximity matrix, should we retrain the random forest for the next iteration or just go ahead with constructing the new proximity matrix? For yes or no, please clarify if same should be done while predicting the missing data in test set where a value is missing and we want to predict the output, y.

    • @statquest
      @statquest  Před 7 měsíci

      I believe your question is answered at 7:40

    • @himanshuparida8813
      @himanshuparida8813 Před 7 měsíci

      @@statquestwhat about the 2nd case when we have a missing data in a new sample that we want to categorise? Do we then also need to train the random forests after we estimate the missing values in each iteration?

    • @statquest
      @statquest  Před 7 měsíci

      @@himanshuparida8813 You do it the same way for both.

    • @himanshuparida8813
      @himanshuparida8813 Před 6 měsíci +1

      Thanks

  • @ammarkhan2611
    @ammarkhan2611 Před 3 lety

    At 10:52, In the final step, shouldn't we use the combinations in which blocked arteries are "No" and Heart disease is "Yes", and another case where blocked arteries are "NO" and heart disease is"YES", and then run the data across the decision trees in order to figure out which is the best combination out of the 4 ?

    • @statquest
      @statquest  Před 3 lety

      You could do that, but random forests use the value for Blocked Arteries that is most commonly associated with "YES Heart Disease" and the value for Blocked Arteries that is most commonly associated with "No Heart Disease".

    • @ammarkhan2611
      @ammarkhan2611 Před 3 lety

      Can it be considered as a drawback of Random Forest as it is picking up the mode without taking into consideration the other possibilities ?

    • @statquest
      @statquest  Před 3 lety

      No, because it follows the exact same procedure as before - it sees how the sample clusters and adjusts as needed.

  • @samymostefai7644
    @samymostefai7644 Před rokem

    I wonder if this is implemented on python using the predefined template with sklearn or if we have to implement it ourselves.

    • @statquest
      @statquest  Před rokem +1

      Unfortunately I don't believe it's is in the python implementation. However, it is part of the R implementation and I show how to do it here: czcams.com/video/6EXPYzbfLCE/video.html

  • @rajatshrivastav
    @rajatshrivastav Před 4 lety

    Hi Josh !!
    It was a very nice Quest, I have a question regarding missing value in the data to be categorized i.e the second type of missing values,what do we start off when we have a numerical value that is missing at the most initial step and also we are predicting a continuous numerical variable(for eg Sales in Millions)
    Q2) How to start off for both categorical and numerical missing values when the Target column in Continous?
    Thank You

    • @statquest
      @statquest  Před 4 lety

      I don't know the answer to that.

    • @rajatshrivastav
      @rajatshrivastav Před 4 lety +1

      ​@@statquest I somehow found the answer to that when the target is continuous it starts with taking the median value for continuous variables and mode for categorical variables that are missing and computes the proximity-weighted average of the missing values. Then this process is repeated several times. Then the model is trained a final time using the RF-imputed data set.

    • @statquest
      @statquest  Před 4 lety

      @@rajatshrivastav Cool! Thanks for looking that up!

    • @rajatshrivastav
      @rajatshrivastav Před 4 lety +1

      Source- Stats Exchange website

  • @greece4surf
    @greece4surf Před 4 lety

    @ 1:58 you have calculated the mean value, not the median value which is just the (N//2)th element in an sorted array of N elements

    • @statquest
      @statquest  Před 4 lety +1

      When you only have two numbers, 125 and 210, the mean = the median = 167.5

  • @alexisward6435
    @alexisward6435 Před 2 lety

    3:44 How would a decision tree like this be made? That is, why does the leaf node end right there without further expansion?

    • @statquest
      @statquest  Před 2 lety

      Again, these are just normal decision trees, so to learn more about them, check out: czcams.com/video/_L39rN6gz7Y/video.html

  • @jmarcos1003
    @jmarcos1003 Před 2 lety

    What if the missing value was "Weight" at the new sample (10:40)? How should we guess it?

    • @statquest
      @statquest  Před 2 lety

      Just like I said at 1:45, we use the median value for the guess.