StatQuest: Random Forests Part 2: Missing data and clustering
Vložit
- čas přidán 14. 01. 2020
- NOTE: This StatQuest is the updated version of the original Random Forests Part 2 and includes two minor corrections.
Last time we talked about how to create, use and evaluate random forests. Now it's time to see how they can deal with missing data and how they can be used to cluster samples, even when the data comes from all kinds of crazy sources.
NOTE: This StatQuest is based on Leo Breiman's (one of the creators of Random Forests) website: www.stat.berkeley.edu/~breima...
For a complete index of all the StatQuest videos, check out:
statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - statquest.gumroad.com/l/wvtmc
Paperback - www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - www.amazon.com/dp/B09ZG79HXC
Patreon: / statquest
...or...
CZcams Membership: / @statquest
...a cool StatQuest t-shirt or sweatshirt:
shop.spreadshirt.com/statques...
...buying one or two of my songs (or go large and get a whole album!)
joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
#statquest #randomforest
NOTE: This StatQuest is the updated version of the original Random Forests Part 2 and includes two minor corrections.
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
hey josh how about a video on parameter estimation and another video on how to interpret complex math equations into more of an intuition. thanks a lot.
@@ytpub01 Both corrections start at 1:15 and are related to how the classification (having heart disease) is used to select which other samples are used for the initial guesses. Originally I omitted this detail.
Initially I feels the uklele sound is awkward, now after 3 days of aggressive learning, that sound make me relief so much.
Thank you Josh.
bam! :)
Thank you so much for the amazing videos Josh, I don't know what I would have done without those videos in my Data Science journey!
Wow! Thank you so much for your support!!! It means a lot to me that you care enough to contribute. BAM!!!
I love these videos because when you get to a concept I don't fully understand, you follow up with.. "check out the statquest for it"... I started with gradient boosting but paused the video and have been detouring for an hour now covering your pre-req videos, and including a couple pre-req's to pre-req's.
An hour in, sipping on a beer, and I can feel myself getting smarter. A huge advantage to the way you do your videos is that I don't have to pace myself to learn only a concept, or part of a concept, a day. I can binge and stay engaged. Great stuff.
Awesome!!! I'm glad you like the videos! :)
same , I started from the gradient boosting - > ada boosting -> random forest
I just loved the video. The way you explained what is proximity matrix and how we can calculate distance matrix from it was the best part. None of the websites explained that part. Thanks for making this useful video. You totally nailed it!
Glad it was helpful!
I just came across your videos and love them! They explain stats in such an intuitive way. They provide a perfect overview that makes it so much easier to digest formulas and code later on. Triple Bam , thumbs up and a biiig thank you!
Thank you very much! :)
These Stat Quests are building my life! I really would like to advice our college professors to learn from here and then teach in college!! But i can't do that XD, so instead, I advised all the students to learn from here! all of them love it!
Thanks Josh!!
Thank you very much!!! :)
This idea is amazing!! Never thought of RF being used for clustering.. just amazing!!
Isn't that cool? Yes, I love that.
@@statquest Yea!! Strange that we never realized that it is "eventually" doing similar job as KNN or Kmeans to predict.
@@rajarajeshwaripremkumar3078 Except random forests can do something KNN can not - cluster with categorical features, or with a combination of categorical and continuous features.
@@statquest How could I use the Random Forest algorithm to calculate the genetic distance? I really love that idea. I would like to try in my DNA barcode samples and them run a cluster of observations. Just for fun! Thank you.
@@davidesouzafernandes6345 You could build trees based on whether or not there was a SNP at a certain locus. Or something like that.
I really appreciate all of your videos.. I am surviving this semester with your awesome, kind, amazing videos. :)
Awesome! Good luck with your classes. :)
Thanks a lot for these gems! Have an interview coming up and needed a refresher!
BAM!!! Thank you so much for supporting StatQuest!!! :)
Thank you very much for this video! It was fun to watch and I learnt a lot from the step-by-step process of adding in missing data!
You are so welcome!
Hello Sir, You made a brave attempt in explaining this topic in a simple manner but honestly speaking this has gone above my head. I need practice a lot before I enter this area.
Did you watch Part 1 of the video? If not check out: czcams.com/video/J4Wdy0Wc_xQ/video.html
Your video hypes me up ! I'll try all those tricks this spring break 👍
Bam! :)
RFs are simply amazing. They can predict well, are a valuable variable selection tool, and now you are telling me they can produce a similarity matrix too!
bam! :)
Great material!
Thank you!
Laughing and learning ? That's how it's supposed to be, cheers to you man !
Bam! :)
thanks for the plain explanation!!! great job
Thank you! :)
Thanks for sharing, with a real talent for explaining
Thank you! :)
I really love your videos!!
Thanks!
Greetings from Mexico
Muchas gracias!!! :)
Dude! You are better than my professor who taught me the RF three years ago!
Thanks!
Did not know that random forests can help in missing value imputation. Thank you 👍
They really are quite cool. However, I've only found that the R version implements these features.
your videos are the BEST.
Wow, thanks!
your teaching is amazing
Thank you!
Very good explanation. Thanks.
Bam! :)
DUDE YOU ARE AMAZING :"D
Thanks 😆!
Hurrey!!!!!!! Hundred Bam!!!!!!!! for your video!! You are awesome. In India people consider teacher as a god and you've become a god of millions of people. Thanks a lot Josh
BAM! Thank you very much! :)
When Josh said this is clearly explained, it's no joke.
:)
You are really great person..!
Thank you! :)
This is the video that made me want to become a BAM! StatQuest member :)
HOORAY!!! Thank you so much for supporting StatQuest!
who does not love StatQuest Songs? can't resist...
Hooray! :)
the tree parsing noises are the best
:)
I need a collection of these intro songs which helped me to relax exam stress
I need to make one. :) Good luck with your exams.
@@statquest Aww..Thank you so much Josh
Super great video. So much info in a concise and effective manner. Just FYI, no big deal, at 1:55, is 167.5 the mean as opposed to median?
It's actually both the mean and the median. Both the mean and median of the two weights associated with people that did not have heart disease, 125 and 210, are 167.5. That said, in general, the mean and median are not equal and in that case you should use the median.
Dear Josh, I just purchased your 'Illustrated Guide to Machine Learning' and I wanted to tell you 2 things:
1 - It really is amazing and the content is explained very well, visually - which is essential. Thank you
2 - I was, though, a tad bit disappointed to not find Random Forests, XGBoost, PCA, LDA, etc. in it. But I get it - it's only a $20 book. However, I wanted to ask if you would/could release another book containing these slightly advanced aspects which you did not cover in the Illustrated Guide? Please let me know. Looking forward to hearing from you.
Yes, I want to write a book about those topics. First I'm working on a deep learning book, but a book on tree-based methods will follow soon.
Thanks for the great video. Can you please make a video on linear mixed models and talk about random effects?
That's on the to-do list. I'm working on XGBoost right now. Next is Neural Networks, then time series. After that I can work on mixed models.
Hello Josh! First, I want to thank you for your awesome videos! I really enjoy watching them and I learn a lot! You put a lot of magic on them, Thank you a lot! you deserve the nirvana, the heaven, everything!
Second, I have two questions,
first one: why are you so great? jajaja
and second one: If the sample with Null values is replaced with the most repeated value in the samples that have the same objective variable as the one with null vales. In this case, the value "NO" in the "Blocked Arteries" column. And then, with this replacement, the Random Forest is created, there will be no bias when building the model?
And that bias will make that in the refining of the guess, the final value will always be the one that was used as the replacement?
(In this case, the final value was "NO" for the column "Blocked Arteries")
Thanks for reading me! Hope you can help me with this doubt. Have a wonderful day!
Say like we put "no" in the blocked arteries column for a sample, but all of the other values in all of the other columns are similar to samples that have "yes", then we will probably end up changing the guess to "yes"
Hey Josh thank you for the amazing video!
In your example, you demonstrated how random forest can deal with categorical missing data when classifying a new sample, how about the following scenarios:
1) Continuous missing data for a classification problem
2) Categorical missing data for a regression problem
3) Continuous missing data for a regression problem
First, remember that regression trees still have discrete output values - they bin the possible output values. So, in this case, we can just treat them as classification trees with more options. Thus, for question #1 and #3, we can just plug in the median value for each possible output value and follow the steps shown earlier in the video. For #2 we plug in the categorical options for each possible output value.
@@statquest Is there a typical way that the output values for regression trees are binned? How would I know how many bins are there? Or is the number of bins typically a hyperparameter?
@@khaikit1232 To be honest, I'm not sure off the top of my head how you'd do it.
Josh, it would be great to provide some reference books for topics too. There are so many books that one can hardly know which one is the best.
I wish I could, but the only book I every use is the Introduction to Statistical Learning (which is a free download). Other than that I read the original manuscripts (and, if I remember, I provide links to them in the video's description....but sometimes I forget to add the links).
Dam ! you still reply to everyone , even after a year . Quadruple BAM!!
bam! :)
thank you
:)
Hey Josh!
Really cool lesson. Thank you!
But I have a little question:
The situation on 10:40 has 4 samples.
Is it enough to make good guess using an iterative method to predict BA for every HD outcome?
Didn't we need to run all 4 samples down the trees in the forest?
Thank you
No, we just use the exact same iterative method described before. We use the most common value for BA given the HD status and then create the similarity matrix and adjust. For more details, see: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Hi! An insightful yet simple video. I was wondering how there can be multiple random forests? In the part 1 video there seem to be the “best” random forest that we only use. Would love to know more!
What time point, minutes and seconds, are you asking about?
Great stuff!!!! Can you talk about clustering using random forest please? Thanks!!!!
Umm.... That's what I talk about in this video...
Nice humming. So alive
Thanks!
I love this at double speed
:)
thanks for covering up evry thing
Thanks!
Thanks for all your videos. Have you ever considered like a remake of all the topics you've taught us but while using R in your examples? I would pay for it.
Are you asking about the webinars or the individual videos (like this one). If you are asking about Random Forests, I have a video (and code) that shows you how to do it in R: czcams.com/video/6EXPYzbfLCE/video.html
creepy triple BAM was the best BAM ever!!!!
Hooray!!! :)
@@statquest You are great
@@michaelethanlevinger2935 Thanks! :)
@@statquest Getting better at being a Data Scientist because of you
@@michaelethanlevinger2935 That's awesome! Good luck on your 'Quest! :)
Josh, great explanation thanks for the video, though I had a question so the above method you discussed for handling missing values seems to make sense for classification but I don't understand how it would for regression? is this entire discussion for classification using random forest?
Unfortunately I couldn't find documentation on how this would work for regression. :(
Hello Josh! Thanks for the video. Really Helpful. Need a clarification and one follow-up question. For the 2nd type of missing value i.e., missing value in the blocked arteries and heart disease (as we have to predict it).
One clarification: At this stage, we have completely built the random forest model and training data is no longer being used. Is this understanding correct?
If not, then I misunderstood 2nd type of missing value. If yes, below is the follow-up question:
Following are the steps:
1. Make copies for the sample with both possible outcome (Yes and No for heart disease). - Understood.
2. Use iterative method for estimating the good guess. - At this stage, I have following question:
Question: We have built the random forest so we won't have training data but just the random forest model. So how do we get the good guess based on just the model?
Yes, the second type of missing values only works if we have already trained the random forest. However, hold onto that training data, because you'll need it for this method.
Great video indeed thanks!
So once you are done fist type of missing data (i.e. during training) you retrain a random forrest but now using the full dataset right ?
Yep
Thanks for the intuitive video. Is the assumption 'missing at random' for RF proximity imputation?
What time point, minutes and seconds, are you asking about?
Hi Josh,
first: thanks for the video, wonderful as always,
second: I'm trying to implement Random Forest data imputation algorithm (presented in your video) in python and I have some questions to you:
1. Can we said what RF hyperparameters for every "refine guess step" should be? Especially max_depth, I think that other hyperparams aren't as much important and we can leave them default, but default max_depth is None => very deep trees, kind of overfitting.
2. Follow up to the above: What about the random state of our RF's -> should it be fixed? If yes, then "the same" RF's are trained on better and better data at each iteration (our data may converge), if not then training is more random, but impossible to repeat (our data may not converge - my observations).
3. What our stop criterion should be? With percentage change (x_{it} - x_{it-1})/x_{it-1} there is a risk of dividing by zero.
4. Please tell me if I'm right: We can write our weighted average formula as dot product of the proximity matrix (each row divided by its sum) and dataframe values, but we can't include our imputed values, so we have to set it to the zeros before the dot product?
I'm glad you're implementing the algorithm, but, I'll be honest, I don't understand your questions about your work. That being said, the R implementation has these features, and you can look at that implementation for guidance. If you want to get a quick overview in how to get random forests working in R, see: czcams.com/video/6EXPYzbfLCE/video.html
did you succeed in what you wanted to do ?
Thanks for this wonderful video..you are excellent..I have a question..when we classify new patient how do we deal if our missing variable is numeric(say weight is missing)?
Presumably use the median values for the two categories to impute the missing value.
Hey Josh, explanation for filling the missing data in a new sample for a categorical variable such as Blocked Arteries was simply awesome. But what about a continuous variable such as Weight? How do we calculate missing weight data for a new sample?
It's the exact same as when we did it at 1:45, however, just like we did for Blocked Arteries, we do it for both categories.
@@statquest Thank you for the confirmation 😄
Thanks, Josh. Apparently, the missing date issue in Random Forest is computationally very expensive when you have a massive number of sample date and many missing variables values. Is this fully automated in software packages like SAS and Python?
As far as I can tell, it is not implemented for Python, however, it is for R and I have a demonstration of how it to use it here: czcams.com/video/6EXPYzbfLCE/video.html
Hi great video as always. Btw can their be a scenario where for both the classes ( YES,NO) in the testing data the tree traversal count is same i.e when the missing column is the most important value in deciding if the target is True or False.
Possibly. If that is the case, it probably just goes to the left (assumes true). That's what XGBoost does by default.
Hi, I have a question, for missing data #2 scenario (a new sample presented to us) how will we guess the entry if it is a continuous variable? maybe weight in the above dataset. Btw awesome content totally enjoying it BAM!!!
Are you asking about what would happen at 10:08 if we had to guess "weight"? We would create two copies, one for Has Heart Disease and one for Does not have heart disease, then we would put the median weight for people that have heart disease and the median value for people that do not have heart disease, and then we would do the proximity thing to refine the guess.
@@statquest Firstly thankyou for your efforts and I have a question related to this, you mentioned putting the median weight for HD and No HD, Is the median weight calculated from training dataset or test dataset ? Also, we repeat the process until the values converges but which value 'missing value' or 'proximity value'?
@@Malikk-em6ix 1) Training dataset 2) Until the missing value no longer changes very much
Don't know if someone can answer this question but for dealing with missing info in the new data, when Josh mentions to run the 2 copies of the data through the iterative process, he just means to run it through all the trees in the existing Random Forest we have already built right? (i.e. we're not new building new trees as a part of that process)
That is correct. We do not build new trees to determining the missing values.
@@statquest Awesome. Thanks for the response! And just mirroring what everyone else is saying, love your videos. Your explanations are always amazing! Will definitely make a contribution on Patreon once I get out of this financial rut I'm in.
I just started learning random forest. Do you know if the inputation for missing data you explained is automatically done by the sklearn random forest regressor? or is this something that we have to code ourselves?
Thanks
I'm not 100% sure, but I do not think the sklearn random forest automatically imputes missing values. Instead, it must be done separately.
@@statquest Thanks Josh, are you aware of any software package that would impute missing data for random forest the same way you described?
Y'all can use this. Works for me.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
Mohanram balaji hey, does this method impute median of all data, or just median for the ‘no’ class?
Thanks, Josh. How about using the random forest in regression to impute and cluster data?
Sure, you can do that. Just replace the classification trees with regression trees.
Thanks Josh for the amazing video. I have two questions and also my opinions on them, if you can please help guide my thinking:
1. How do we deal with outliers in the training data before fitting Random Forests?
Here is what I think:
I think outliers can mess with the performance of Random Forests and we should remove them. This is because Random Forests will have tendency to split and isolate those outlier regions whenever that feature is selected which has outlier values. If the y values contain outliers, the problem is even more severe.
2. Does scaling the data before fitting random forest increase the performance of Random Forests? Or for that matter Gradient Boosting?
Here is what I think:
I think that scaling will not have an impact on performance as when we are splitting we are only looking at the output values for all variables. The final metric, sum of squared residuals does not contain any X. We only iterate over a range oof possible values of X. So the scale of Xs should not have any impact on what X is selected.
Please let me know if I got it right!
1) Outliers, in general, are problematic.
2) For trees, in general, there is no need to scale the data.
hi great video man...I have one doubt we fill missing value by guess, then we'll classify it based on most frequent correct label(target variable) right?
how can we make guess for single sample?
What do you mean by "single sample"?
Hi, great video! 10:35 For new data can't we use the same method for filling in null values as we did with the training data?
Yes, that's what we do, but we do it for both classes, since we don't know which class the new data belongs to at first.
3B1B, Welchlabs, and you, StatQuest.... If there were Avengers for teaching, you'd be one.
Bam! :)
When you got missing data and missing label + an existing forest.
Do you use the existing forest to fill the missing values? Then use that same existing forest to predict the label?
I feel a different forest should be used to estimate missing values, as the new sample influences the tree creation process… if someone knows… thank you!
In practice, we create separate forests. For details, see: czcams.com/video/6EXPYzbfLCE/video.html
Thank you Josh for creating such a clear and easy to understand video. How could you make an initial guess if the Heart Disease column was numeric as well? When you are taking your initial guess you wouldn't be able to pick the patients with the same entry as in this example. I hope that make sense, thanks again.
To be honest, I don't know how this works for regression off the top of my head.
@@statquest Thanks for replying. Perhaps a weighted average of the "Weight" entries where the weights are the difference between the known numeric "Heart Disease" values?
@@megatitchifyHey! I have the same doubt. Have you ever discovered the answer? Thank!
Thank you a lot! clearly explained. I just wonder if sci-kit learn in python also deals with the missing value automatically. or we have to call other commands.
Unfortunately, the sci-kit learn implementation of random forests does not implement any of this cool stuff.
Just stumbled on your video; it has been very helpful! I wanted to verify something regarding the heatmap and MDS plot you showed in this specific video. Aren't these technically respective to the first ProxMat in this video (0.9 prox value b/w 3&4)? It threw me off for a second because they were shown next to the hypothetical ProxMat you created that showed sample 3 and 4 as being close as close can be (1.0 prox value b/w 3&4). 9:18
To be honest, it's been so long since I created this video that I can't give you a certain answer. However, I hope you can understand the main ideas.
@@statquest very understandable, all good!
Hi. This concept is really cool:) had a doubt though, can we handle the proximity based missing value concept using current sklearn package? If not how can we do it?
Unfortunately the sklearn implementation of Random Forests is terrible and does not include this feature. However, the R package does, so I recommend using that instead. Here's the code: czcams.com/video/6EXPYzbfLCE/video.html
Thanks for the explanation, very precise.
One question about the last example , in case the missing data in the testing sample - if the instead of blocked arteries, weight was missing and the prediction column is also numerical instead of binary, what steps are recommended ?
You would create new "pseudo" observations for each output value in the trees.
@@statquest Thanks for reply, as I understand instead of 2 copies ,( yes/no), we will need to create N copies then find try to guess the missing data from features using proximity matrix.
How should this proximity matrix be calculated for numerical columns ?
@@abhishekdnyate8508 In this case you'll be using regression trees, which still bin their output values in to discrete bins. For more details, check out the 'Quest on Regression Trees: czcams.com/video/g9c66TUylZ4/video.html
10:44 Then we use the iterative method we just talked about to blabla... I am a bit confused that what iterative method refered to. Does it include the proximity metric or not (i.e. just guess based on other samples)? Thank you
It is the exact method we just talked about. We calculate the proximity matrix over and over again until it converges.
Great videos, thanks a lot! Quick question: At minute 6:17, it feels like it should be: No = 0.1 * 1/3 + 08 * 1/3 and Yes = 0.1 * 1/3. Accumulating each rows influence onto the 4. row (unknown row). Otherwise, aren't we double counting?
Why do you think we are double counting? The weight function ensures that the sum of the weights for all of the categories (in this case, we only have 2 categories, but we could have) is equal to 1. Thus, the weights are normalized.
Thanks for the reply. My reasoning was:
We are trying to predict the unknown value for the Blocked Arteries of the 4th row. Each other row (1,2, and 3) will help us to predict it, based on their Blocked Arteries value and how close (in the proximity) they are to the 4. row.
2nd row has the value 'yes' and its proximity is 0.1. So it is saying 4th row should be 'yes'
and the weight for Yes should be = 1/3 * 0.1
1st and 3rd rows have the value 'no'. In other words, they are voting for 'no' for 4th rows unknown value.
1st row's proximity is 0.1. The weight for 'No' by the 1st row is : 1/3 * 0.1
3rd row's proximity is 0.8. The weight for 'No' by the 3rd row is : 1/3 * 0.8
Total weight for 'No' = (1/3 * 0.1) + (1/3 * 0.8)
(I think this approach is also aligned with the following section on how to calculate the numeric unknown values) (Sorry, my post became a bit lengthy, I just wanted make sure I am not missing the point). Thanks again!
@@mehmetb5132 Leo Breiman, who created random forests, says "[For] a missing categorical variable, replace it by the most frequent non-missing value where frequency is weighted by proximity." ( see: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 )
I'm pretty sure that the frequencies of the "non-missing values" are 1/3 and 2/3s, and not 1/3, 1/3 and 1/3, because there are only 2 "non-missing values" and using 3 frequencies suggests that there are more than 2 "non-missing values". So, in this example we are creating weights for the frequencies of the non-missing values, 1/3 and 2/3. Does that make sense or am I miss-interpreting Leo Breiman's text?
@@statquest Thanks for the comment and the link for the original inventors page, awesome! To be honest, I am having a hard time to interpret the original text from Leo Breiman's sentence. I feel like it could go either way. It's great that he included the code for his implementation. It is in Fortran!!! I guess curiosity kills the cat; I tried to decipher it.
(www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm)
He has an 'impute' function, yay! There are two sections (loops), first one is for numerical values (if(cat(m).eq.1) then). Second, for categorical values (if(cat(m).gt.1) then)
(www.stat.berkeley.edu/~breiman/RandomForests/cc_manual.htm#c2)
Inside the second loop,
1 do m=1,mdim
2 if(cat(m).gt.1) then
3 do n=1,near
4 if(missing(m,n).eq.1) then
5 call zervr(votecat,maxcat)
6 do k=1,nrnn
7 if (missing(m,loz(n,k)).ne.1) then
8 j=nint(x(m,loz(n,k)))
9 votecat(j)=votecat(j)+real(prox(n,k))
10 endif
11 enddo !k
12 rmax=-1
13 do i=1,cat(m)
14 if(votecat(i).gt.rmax) then
15 rmax=votecat(i)
16 jmax=i
17 endif
18 enddo
19 x(m,n)=real(jmax)
20 endif
21 enddo !n
22 endif
23 enddo !m
He has an array called 'votecat' that he fills with zeros first (line 5)
then in an inner loop,
he accumulates (just adding) the proximity values. Seems like he is assuming that each row without missing value is equally participating (line 9)
Finally, picking the maximum as the winner (line 15)
I can't see that he is doing any additional work in term of frequencies, unless I am missing something?
@@statquest Hi josh, awesome video! But I think @Mehmet B is correct. Say we have n rows without missing value. Each row just does an accumulating contribution of 1/n * proximity. You can't calculate the aggregate frequencies first and then apply the proximity scores on aggregate frequencies - where a proximity score is gonna make effect on other non-corresponding row(s). Leo Breiman's wording might be a little confusing, but I believe by saying "where *frequency* is weighted by proximity", he intends to mean a row-by-row manner instead of calculating frequencies overall. The above could also be reflected by the source code that @Mehmet B has shown.
Excellent video, however I did not fully understand the denominator used in calculating the weighted values from the proximity matrix.
It's basically the same technique used to calculate a weighted mean. www.statisticshowto.com/probability-and-statistics/statistics-definitions/weighted-mean/
Thanks for the amazing videos. I have a question about missing values in the example in 11:25 min was
the first one blocked arteries: Yes and Heart Disease: Yes
the second one blocked arteries: No and Heart Disease: No
do we have to make four of these like the below? just to cover all the probablity
the first one blocked arteries: Yes and Heart Disease: Yes
the second one blocked arteries: No and Heart Disease: Yes
the third one blocked arteries: Yes and Heart Disease: No
the fourth one blocked arteries: No and Heart Disease: No
No. At 10:40 I explain how we find the optimal values for "blocked arteries" without having to try all possible combinations.
I like your video, however, at 10:51, I could not imagine how to come up with such imputed values for blocked arteries of those 2 samples. Since the iterative guessing looked for other samples (e.g. at 5:02) and use the proximity matrix, how can just one unseen sample (at 10:15) can do like that.
I'm not sure I fully understand your question, but at 5:02 we see that the one person that has hart disease also has blocked arteries, so it makes sense that, when we impute 'blocked arteries" for someone new that has heart disease, we would impute "yes" for blocked arteries. Likewise, the other people without heart disease at 5:02 also did not have blocked arteries. So it makes sense to impute "no" for someone new that does not have heart disease.
I have a question. For missing data #2 scenario, if I have multiple labels, does that mean I need to make multiple copies with each potential label? Thank you so much!!
Yes
I am confused between entropy and gini index. Which method is used in libraries like sklearn to calculate impurity of decision tree?? Or which method should be used to split the nodes in decision tree?
I believe both methods are available, but GINI is default.
It’s awesome video. May i ask for the missing value in new sample, if it is regression problem how we did it?
I'm not 100% certain, but you could probably just plug in a bunch of values and test each one.
Random Forest builds multiple trees initially and then impute missing data using those trees(proximity table method). Now, these trees will be used for predicting the final output.
Hence, filling the missing values in the training set had no effect on the prediction of our test dataset. Then why do we bother filling the missing values in RF?
Once you impute the data, you can build a new random forest with the full dataset.
time stamp 9.24 what does this black and blue colors in heatmaps determine and how are your correlating heatmap colors with distance matrix numbers?
To understand heatmaps, see: czcams.com/video/oMtDyOn2TCc/video.html
Regarding the last part, would it make sense to make two more assumptions and see how often they are predicted correctly using random forests? That is 1) Blocked Arteries = Yes , Heart Disease = No 2) Blocked Arteries = No, Heart Disease = Yes.
You could try that, but it makes more sense to create two samples, one that has heart disease and one that doesn't, and then use the iterative method described earlier to impute the best value.
Quick question - at 1:55 you say median but it's also equal to the mean, are they the same when only two values are sampled (and would be different in a larger dataset, so use median)?
Yes, in this example, the mean and median are the same because there are only two samples. If there were more, then, chances are, they would not be the same and we should use the median.
I’m wondering for the second part of this video, how should we guess the initial missing numerical values? What guessing options do we have in this case? Thank you~~~~ BamBam
In general, how should I take the initial guess if we have multi labels or even numerical continuous output instead of a binary classification problem? Should we creat multiple samples?
If a feature or variable is binary or categorical (or some other discrete value), we choose the option that is most common among observations that have the same outcome. For continuous values, we take the median value among observations with the same outcome. This is described starting at 1:12
@@statquest Thanks for the answer~~ I think my words are confusing. I am wondering the above method will work for binary or multilabel classification trees. What if we want to solve a random forest regression problem, where the outcomes are continuous values? How can we fill the missing values for the training samples (samples to create our forest) and the testing samples( samples that we want to predict the outcome using our forest)? Thank you BAM BAM BAM~
@@hellochii1675 That's a good question and I don't know the answer to it. The good news is that XGBoost does regression and has a way of dealing with missing values in that context - so we have options that are similar that we can use if we need to.
Great video. But there is something in type 2 that I wonder what your solution. What if the missing data is numeric type such as weight, how do we encounter it?
To be honest, I'm not sure I would use a random forest for a regression problem to begin with.
What do we mean by the iterative method when finding missing data in sample to classify, is that the one we used to interpolate the training data.
Yes
@@statquest thanks, for the clarification. Love your work it's so interesting, i can binge watch the entire playlist.
Much thanks from a sophomore!
Thank you! :)
I am really sorry to bombard you with questions.... I was just thinking before that after we turn a decision tree into a random forest, we lost some interpretability of the data, for example, which variable contributed the most to the final decision. After looking at random forest clustering (I love it! I haven't seen another resource that actually expands on this), I was thinking since we can turn it into a heat map or MDS, then we can still have some ideas about which variable matters the most right? I want to make sure I am getting the right idea.
Actually, Random Forests have a specific way to calculate variable importance. To quote from the source (the source is here: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm ) "In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m."
@@statquest Thank you so much! And you pointed to me the right keywords/content to Google (don't know why I didn't think of this myself before....) and now I have a lot more info. about this. This is so cool!
@@20060802Lin Hooray!!! And thank you so much for supporting StatQuest!!!! It really means a lot to me when someone cares enough to contribute.
@@statquest Thank you for caring enough to answer all mine and others' questions :)
Thanks for this video! I have a question, you explained how to fill in missing data in a new object we want to predict, but you only explained how to do it when the missing data is binary. How would I do it for data that isn't binary like height or weight?
You start by using the median value.
at 2:21 step 1 when build the tree. Should be the tree with initial guessed value used? or only use the data with no missing values to build the tree? Thank you
I believe we build the data with the guessed value. For all of the details, see: www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1
@@statquest Thank you for the quick reply, You are the best explainer of ML I found ever. Thank you
Hmmm, I'm wondering suppose you had a random sample and had data of 11130 patients about their chest pain; 120 about their blood circulation; 40 about if they had blocked arteries; 7 about their chest pain and blood circulation; 8 about their chest pain and if they had blocked arteries; 3 about their chest pain and weight; 31 about their chest pain and blood circulation and if they had blocked arteries; 30 about their chest pain, weight, and blood circulation; 10 about their chest pain, if they had blocked arteries, and their weight; and 371 with all 4 inputs. And of these 11175 patients we knew if they had heart disease or not. Would a random forest made with all the data be much better than one that dropped all the "one input" patients?
Try it! Interestingly, another method called XGBoost is designed to work with missing data like you have. I've put out 2 videos on it so far and I've got at least 2 more to go. I won't get to the part about dealing with missing data until part 4. Here's the link to part 1: czcams.com/video/OtD8wVaFm6E/video.html
At 10:33 after you created the 2 copies of the data, one with heart disease, and one without: I don't understand what do you mean by "the iterative method we just talked about" and how that guesses were selected.
In the method you mentioned at 1:28 we have chosen the "most common value found in other samples", but this is a new sample that we want to categorize, not a training data, and confuses me a bit
At 10:33 we are trying to impute a value for Blocked Arteries, however, we don't know whether or not use the training data associated with people who have heart disease or the training data associated with people who do not. So we try both and use them to impute the value for Blocked Arteries. We then have to decide which one is better, and we select the one that is labeled correctly the most times.
Are there any statistical justificiations for these methods, or they are just engineering fixes?
It depends on the method - often statistics is playing catch up to ML, with the ML method created and the statistical justification for why it works so well coming later. But I believe this method started out with justification as the creator, Leo Breiman, is in the stats department at Berkeley.
@statquest after we find out the missing values the first time using the proximity matrix, should we retrain the random forest for the next iteration or just go ahead with constructing the new proximity matrix? For yes or no, please clarify if same should be done while predicting the missing data in test set where a value is missing and we want to predict the output, y.
I believe your question is answered at 7:40
@@statquestwhat about the 2nd case when we have a missing data in a new sample that we want to categorise? Do we then also need to train the random forests after we estimate the missing values in each iteration?
@@himanshuparida8813 You do it the same way for both.
Thanks
At 10:52, In the final step, shouldn't we use the combinations in which blocked arteries are "No" and Heart disease is "Yes", and another case where blocked arteries are "NO" and heart disease is"YES", and then run the data across the decision trees in order to figure out which is the best combination out of the 4 ?
You could do that, but random forests use the value for Blocked Arteries that is most commonly associated with "YES Heart Disease" and the value for Blocked Arteries that is most commonly associated with "No Heart Disease".
Can it be considered as a drawback of Random Forest as it is picking up the mode without taking into consideration the other possibilities ?
No, because it follows the exact same procedure as before - it sees how the sample clusters and adjusts as needed.
I wonder if this is implemented on python using the predefined template with sklearn or if we have to implement it ourselves.
Unfortunately I don't believe it's is in the python implementation. However, it is part of the R implementation and I show how to do it here: czcams.com/video/6EXPYzbfLCE/video.html
Hi Josh !!
It was a very nice Quest, I have a question regarding missing value in the data to be categorized i.e the second type of missing values,what do we start off when we have a numerical value that is missing at the most initial step and also we are predicting a continuous numerical variable(for eg Sales in Millions)
Q2) How to start off for both categorical and numerical missing values when the Target column in Continous?
Thank You
I don't know the answer to that.
@@statquest I somehow found the answer to that when the target is continuous it starts with taking the median value for continuous variables and mode for categorical variables that are missing and computes the proximity-weighted average of the missing values. Then this process is repeated several times. Then the model is trained a final time using the RF-imputed data set.
@@rajatshrivastav Cool! Thanks for looking that up!
Source- Stats Exchange website
@ 1:58 you have calculated the mean value, not the median value which is just the (N//2)th element in an sorted array of N elements
When you only have two numbers, 125 and 210, the mean = the median = 167.5
3:44 How would a decision tree like this be made? That is, why does the leaf node end right there without further expansion?
Again, these are just normal decision trees, so to learn more about them, check out: czcams.com/video/_L39rN6gz7Y/video.html
What if the missing value was "Weight" at the new sample (10:40)? How should we guess it?
Just like I said at 1:45, we use the median value for the guess.