Classification Trees in Python from Start to Finish

Sdílet
Vložit
  • čas přidán 20. 07. 2024
  • NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/tzxoh
    This webinar was recorded 20200528 at 11:00am (New York time).
    NOTE: This StatQuest assumes are already familiar with:
    Decision Trees: • StatQuest: Decision Trees
    Cross Validation: • Machine Learning Funda...
    Confusion Matrices: • Machine Learning Funda...
    Cost Complexity Pruning: • How to Prune Regressio...
    Bias and Variance and Overfitting: • Machine Learning Funda...
    For a complete index of all the StatQuest videos, check out:
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Buying my book, The StatQuest Illustrated Guide to Machine Learning:
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    CZcams Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    5:23 Import Modules
    7:40 Import Data
    11:18 Missing Data Part 1: Identifying
    15:57 Missing Data Part 2: Dealing with it
    21:16 Format Data Part 1: X and y
    23:33 Format Data Part 2: One-Hot Encoding
    37:29 Build Preliminary Tree
    46:31 Pruning Part 1: Visualize Alpha
    51:22 Pruning Part 2: Cross Validation
    56:46 Build and Draw Final Tree
    #StatQuest #ML #ClassificationTrees

Komentáře • 582

  • @statquest
    @statquest  Před 4 lety +26

    NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/tzxoh
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @funnyclipsutd
    @funnyclipsutd Před 4 lety +67

    BAM! My best decision this year was to follow your channel.

  • @renekokoschka707
    @renekokoschka707 Před 3 lety +7

    I just started my bachelor thesis and i really wanted to thank you!
    Your videos are helping me so much.
    You are a LEGEND!!!!!

    • @statquest
      @statquest  Před 3 lety +1

      Thank you and good luck! :)

  • @ccuny1
    @ccuny1 Před 4 lety +2

    I have already commented but I watched the video again and I have to say I am even more impressed than before. truly fantastic tutorial, not too verbose but with every action clarified and commented in the code, beautifully presented (I have to work on my markdown; there are quite a few markdown formats you use that I cannot replicate...to study when I get the notebook). So all in all, one of the very top ML tuts I have ever watched (including paid for training courses). Can't wait for today's or tomorrows webinars. Can't join in real time as based in Europe, but will definitely pick it up here and get the accompanying study guides/code.

    • @statquest
      @statquest  Před 4 lety

      Hooray!!! Thank you very much!!!

  • @1988soumya
    @1988soumya Před 4 lety +3

    Hey Josh, it’s so good to see you are doing this, I am preparing for some interviews, it will help a lot

  • @montserratramirez4824
    @montserratramirez4824 Před 4 lety +7

    I love your content! Definitely my favorite channel this year
    Regards from Mexico!

    • @statquest
      @statquest  Před 4 lety +2

      Wow, thanks! Muchas gracias! :)

  • @jahanvi9429
    @jahanvi9429 Před rokem +5

    You are so so helpful!! I am a data science major and your videos saved my academics. Thank you!!

  • @liranzaidman1610
    @liranzaidman1610 Před 4 lety +10

    Josh,
    this is really great.
    Can you upload videos with some insights on your personal research and which methods did you use?
    And some examples of why you prefer to use one method instead of the other? I mean, not only because you get a better result in RUC/AUC but is there a "biological" reasoning for using a specific method?

  • @ozzyfromspace
    @ozzyfromspace Před 3 lety +2

    I dunno how I stumbled on your channel a few videos ago, but you've really got me interested in statistics. Nice Work sir 😃

  • @kaimueric9390
    @kaimueric9390 Před 4 lety +6

    I actually think it can be great if you created more videos for other ML algorithms. After teaching us almost every aspect of machine learning algorithms as far as the mechanics and the related fundamentals are concerned, I feel it is high time to see those in action, and Python is, of course, the best way to go.

  • @dhruvishah9077
    @dhruvishah9077 Před 3 lety +2

    I'm absolute beginner and this is what i was looking. Thank you so much for this. Much appreciated sir!!

  • @ccuny1
    @ccuny1 Před 4 lety +1

    Another hit for me. I will be getting the Jupyter notebook and some if not all of you study guides (I only just realised they existed).

    • @statquest
      @statquest  Před 4 lety

      BAM! :) Thank you very much! :)

  • @ravi_krishna_reddy
    @ravi_krishna_reddy Před 3 lety +4

    I was searching for a tutorial related to statistics and landed here. At first, I thought this is just one among many low quality content tutorials out there, but I was wrong. This is one of the best statistics and data science related channels I have seen so far, wonderful explanation by Josh. Addicted to this channel and subscribed. Thank you Josh for sharing your knowledge and making us learn in a constructive way.

  • @beebee_0136
    @beebee_0136 Před 2 lety

    I'd like to thank you so much for making this stream cast available!

  • @nataliatenoriomaia1635
    @nataliatenoriomaia1635 Před 3 lety +1

    Great video, Josh! Thanks for sharing it with us. And I have to say: the Brazilian shirt looks great on you! ;-)

  • @robertmitru7234
    @robertmitru7234 Před 3 lety +1

    Awesome StatQuest! Great channel! Make more videos like this one for the other topics. Thank you for your time!

  • @3ombieautopilot
    @3ombieautopilot Před 4 lety +2

    Thank you very much for this one! You're channel is incredible! Hats off to you

  • @rhn122
    @rhn122 Před 3 lety +6

    Great tutorial! One question, by looking at the features included in the final tree, does it mean that only those 4 features are considered for prediction, i.e., we don't need the rest so we could drop those columns for further usage?

  • @xiolee7597
    @xiolee7597 Před 4 lety +4

    Really enjoy all the videos! Can you do a series about mixed models as well, random effects, choosing models, interpretation etc. ?

  • @aryamohan7533
    @aryamohan7533 Před 3 lety +1

    This entire video is a triple bam! Thank you for all your content, I would be lost without it :)

  • @anishchhabra5313
    @anishchhabra5313 Před 2 lety +1

    This is legen..... wait for it
    ....dary!! 😎
    This detailed coding explanation of Decision Tree is hard to find but Josh you are brilliant. Thank you for such a great video.

  • @fuckooo
    @fuckooo Před 3 lety +1

    Love your videos Josh, the notebook missing values sounds like a great one to do!

  • @bayesian7404
    @bayesian7404 Před 4 měsíci +1

    You are fantastic! I'm hooked on your videos. Thank you for all your work.

  • @jefferyg3504
    @jefferyg3504 Před 3 lety +1

    You explain things in a way that is easy to understand. Bravo!

  • @DANstudiosable
    @DANstudiosable Před 4 lety +5

    OMG... I thought you'd ignore when i asked you to post this webinar on youtube. Am glad you posted it. Thank you!

  • @JoRoCaRa
    @JoRoCaRa Před rokem +1

    brooo... this is insane!! thanks so much! this is amazing saving me so many headaches

  • @juniotomas8563
    @juniotomas8563 Před 4 měsíci +1

    Come on, Buddy! I've just saw a recommendation to your channel and on the first video I see you with a Brazilian t-shirt. Nice surprise!

  • @jonastrex05
    @jonastrex05 Před 2 lety +1

    Amazing video! One of the best out there for this Education! Thank you Josh

  • @Mohamm-ed
    @Mohamm-ed Před 3 lety +2

    This voice remembering me when I listening to radio in UK. Love that. I want to go again

  • @rajatjain7465
    @rajatjain7465 Před rokem +1

    wowowowwo the best course ever, even better than all those paid quests thank you @josh stramer for these materials

  • @naveenagrawal_nice
    @naveenagrawal_nice Před 6 měsíci +1

    Love this channel, Thank you Josh

  • @magtazeum4071
    @magtazeum4071 Před 4 lety +2

    BAM...!!! I'm getting notifications from your channel again

  • @utkarshsingh2675
    @utkarshsingh2675 Před rokem +1

    this is what I have been looking for on youtube...thanks alot sir!!

  • @ericwr4965
    @ericwr4965 Před 4 lety +1

    I absolutely love your videos and I love your channel. Thanks for this.

  • @pratyushmisra2516
    @pratyushmisra2516 Před 4 lety +4

    My intro song for this channel:
    " It's like Josh has got his hands on python right,
    He teaches Ml and AI really Well and tight ---- STAT QUEST"
    btw thanks Brother for so much wonderful content for free.....

  • @sameepshah3835
    @sameepshah3835 Před měsícem +1

    I love you so much Josh. Thank you so much for everything.

  • @creativeo91
    @creativeo91 Před 3 lety +4

    This video helped me a lot for my Data Mining assignment.. Thank you..

  • @bessa0
    @bessa0 Před 2 lety +1

    Kind Regards from Brazil. Loved your book!

  • @joaomanoellins2219
    @joaomanoellins2219 Před 4 lety +25

    I loved your Brazil polo shirt! Triple bam!!! Thank you for your videos. Regards from Brazil!

  • @alexyuan1622
    @alexyuan1622 Před 3 lety +1

    Hi Josh, thank you so much for this awesome posting! Quick question, when doing the cross validation, should the cross_val_score() using [X_train, y_train] or the [X_encoded, y]? I'm wondering if the point of doing cross validation is to let each chunk of data set being testing data, should we then use the full data set X_encoded an y for the cross validation? Thank you!!

    • @statquest
      @statquest  Před 3 lety +1

      There are different ideas about how to do this, and they depend on how much data you have. If you have a lot of data, it is common to hold out a portion of the data to only be used for the final evaluation of the model (after optimizing and cross validation) as demonstrated here. When you have less data, it might make sense to use all of the data for cross validation.

    • @alexyuan1622
      @alexyuan1622 Před 3 lety +1

      @@statquest Thanks for the quick response. That makes perfect sense.

  • @user-lc8gc6vb3j
    @user-lc8gc6vb3j Před 10 měsíci +2

    Thank you, this video helped me a lot! For anyone else following along in 2023, the way the confusion matrix is drawn here didn't work for me anymore. I replaced it with the following code:
    cm = confusion_matrix(y_test, clf_dt_pruned.predict(x_test), labels = clf_dt_pruned.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=['Does not have HD', "Has HD"])
    disp.plot()
    plt.show()

    • @statquest
      @statquest  Před 10 měsíci

      BAM! Thank you. Also, I updated the jupyter notebook.

  • @filosofiadetalhista
    @filosofiadetalhista Před 2 lety +1

    Loved it. I am working on Decision Trees on my job this week.

  • @sharmakartikeya
    @sharmakartikeya Před 3 lety +1

    Hurray! I saw your face for the first time! Nice to see one of those whom I have subscribed

  • @gbchrs
    @gbchrs Před 2 lety +1

    your channel is the best at explaining complex machine learning algorithm step by step. please make more videos

    • @statquest
      @statquest  Před 2 lety

      Thank you very much!!! Hooray! :)

  • @liranzaidman1610
    @liranzaidman1610 Před 4 lety +2

    Fantastic, this is exactly what I needed

  • @josephgan1262
    @josephgan1262 Před 2 lety

    Hi Josh, Thanks for the video again!!. I have some questions hope you don't mind to clarify in regards to pruning in general hyperparameter tuning. I see that in general the video has done the following to find the best alpha.
    1) After train test split, find the best alpha after comparison between test and training (single split). @50:32
    2) Rechecking the best alpha by doing CV @52:33. It is checked that that is huge variation in the accuracy, and this implies that alpha is sensitive to different training set.
    3) Redo the CV for to find the best alpha by taking the mean of accuracy for each alpha.
    a) At step two, do we still need to plot the training set accuracy to check for overfitting? (it is always mention that we should compare training & testing set accuracy to check for overfitting) but there is an debate on this as well. ( Where other party mentioned that for a model-A of training/test accuracy of 99/90% vs another model-B : 85/85%. We should pick model-A with 99/90% accuracy because 90% testing accuracy is higher than 85% even though the model-B has no gap (overfitting) between train & test. What's your thought on this?
    b) What if I don't do step 1) and 2) and straight to step 3) is this a bad practice? do i still need to plot the training accuracy to compare with test accuracy if I skip step 1 and step 2? Thanks.
    c) I always see that the final hyper parameter is decided on highest mean of accuracy of all K-folds. Do we need to consider the impact of variance in K-fold? surely we don't want our accuracy to jump all over the place if taken into production. if yes, what is general rule of thumb if the variance in accuracy is consider bad.
    Sorry for the long posting. Thanks!

    • @statquest
      @statquest  Před 2 lety

      a) Ultimately the optimal model depends on a lot of things - and often domain knowledge is one of those things - so there are no hard rules and you have to be flexible about the model you pick.
      b) You can skip the first two steps - those were just there to illustrate the need for using cross validation.
      c) It's probably a good idea to also look at the variation.

  • @junaidmalik9593
    @junaidmalik9593 Před 3 lety

    Hi Josh, one amazing thing about the playlist is the song u sing before starting the video, that refreshes me. u know how to keep the listener awake for the next video. hehe. and really thanks for the amazing explanation.

  • @Kenwei02
    @Kenwei02 Před 2 lety +1

    Thank you so much for this tutorial! This has helped me out a lot!

  • @douglasaraujo9763
    @douglasaraujo9763 Před 3 lety +1

    Your videos are always very good. But today I’ll have to commend you on your fashion choice as well. Great-looking shirt! I hope you have had the opportunity to visit Brazil.

    • @statquest
      @statquest  Před 3 lety

      Muito obrigado! Eu amo do Brasil! :)

  • @ramendrachaudhary9784
    @ramendrachaudhary9784 Před 3 lety +2

    We need to see you play some tabla to one of your songs. Double BAM!! Great content btw :)

  • @teetanrobotics5363
    @teetanrobotics5363 Před 3 lety +1

    Amazing man. I love your channel. Could you please reorder this video , SVMs and Xgboost in the correct order in the playlist ?

  • @simaykazc1508
    @simaykazc1508 Před 3 lety +1

    Josh is the best. I learned a lot from him!

  • @jihowoo9667
    @jihowoo9667 Před 4 lety +1

    I really love your video, it helps me a lot!! Regards from China.

  • @umairkazi5537
    @umairkazi5537 Před 4 lety +1

    Thank you very much . This video is very helpful and clears a lot of concepts for me

  • @estebannantes8567
    @estebannantes8567 Před 4 lety +1

    Hi Josh. Loved this video. I have two questions: 1- Is there any way to save our final decision tree model to use it later in unseen data without having to train it all again? 2- Once you have decided on your final alpha: why not training your tree on a full-unsplit dataset. I know you will not be able to generate a confusion matrix, but wouldn't your final tree be better if it is trained with all the examples?

    • @statquest
      @statquest  Před 4 lety +1

      Yes and yes. You can write the decision tree to a file if you don't want to keep it in memory (or want to back it up). See: scikit-learn.org/stable/modules/model_persistence.html

  • @Nico.75
    @Nico.75 Před 3 lety

    Hi Josh, such an awesome helpful video, again! May I ask you a basic question? When I'm doing an initial decision tree model building using train/test split and evaluate training and test accuracy scores and then start over doing k-fold cross validation on the same training set and evaluate it on the same test set as in the initial step -> is that a proper method? Because I used the same test set for evaluation twice, first on the initial train/test split method and second using the crossvalidation method? I read you should us your test (or hold out) set only once… Last question: Should you use the exactly same training/test set for comparing different algorithms (decision trees, random Forests, logistic Regression, kNN, etc...)? Thanks so much for a short feedback and quest on! Thanks and BAM!!!

    • @statquest
      @statquest  Před 3 lety

      Yes, I think it's OK to use the same testing set to compare the model before optimization and after optimization.
      Ideally, if you are comparing different algorithms, you will use cross validation and pick the one that has the best, on average, score. Think of picking an algorithm like picking a hyperparameter.

  • @floral7448
    @floral7448 Před 3 lety +1

    Finally have the honor to see Josh :)

  • @danielw7626
    @danielw7626 Před 3 lety

    Hi Josh, thanks for your clear explanation. it's very helpful. One quick question, do we need to delete one column after perform OneHotEncoding to avoid the dummy variable trap? Thank you in advance if you could clarify this for me as I only start learning ML for 1 month. Cheers

  • @bjornlarsson1037
    @bjornlarsson1037 Před 4 lety

    Absolutely amazing work Josh! You are definitely the best guy on the internet teaching this stuff! Just a question on reproducibility when using get_dummies vs. other methods of enconding. I used make_column_transformer together with make_pipeline. My pruned tree was different in that the node "variables" were different, but the numbers (cutoffs, ginis, samples, values, class) were identical. I also got small differences at other places compared with your result. Given that I have followed along with your code (and used the same random states as you did), should I get exactly the same results as you did (under the assumption that I haven't made any error of course) or is it possible that the results may differ between methods? Thanks again Josh!

    • @statquest
      @statquest  Před 4 lety

      It should be the same. Hmm... This is an interesting problem.

    • @bjornlarsson1037
      @bjornlarsson1037 Před 4 lety

      @@statquest Okay, I have now at least figured out why the pruned tree is different, and that was because the column names were out of order because apparently make_column_transformer puts the dummy columns at the beginning of the dataset instead of at the end as with get_dummies. But there are still differences in that the last confusion matrix is identical to yours, but the first cofusion matrix is slighlty different, even though I called the methods in exactly the same way on both of them. But since you said on your reply that we should get identical results, it most be something I have done differently than you on the first one, but I can't really see what right now

  • @bardhrushiti184
    @bardhrushiti184 Před 4 lety

    Great video - thanks for sharing such valuable content.
    I have a question regarding the alpha/accuracy graph: In my dataset, the training and testing accuracy are relatively close (~100% and ~98%, respectively) and after plotting Accuracy vs Alpha for training and testing, it seems that as the alpha increases, the accuracy decreases as well. At alpha = 0, the accuracy (train = ~100% and test = ~98%), at alpha = 0.011, the accuracy (train = ~92.5% and test = ~92.1%), and it decreases. Should I still consider doing pruning with alpha, even though it seems that the model is doing okay?
    Thank you in advance!
    Keep posting awesome videoes !

    • @statquest
      @statquest  Před 4 lety

      If the full sized tree performs best on your testing data, then you don't need to prune.

  • @amc9520
    @amc9520 Před rokem +1

    Thanks for making my life easy.

  • @abdelrhmansayed5436
    @abdelrhmansayed5436 Před 2 lety +1

    thank you for your great effort and simple explanation, i have only one question that is why did you split the data into X_train and y_trrain and then give it to cross_val_score , shouldn't coss validtion works on all X ?

    • @statquest
      @statquest  Před 2 lety

      In theory we are trying to save some data for a final validation of the model.

  • @fernandosicos
    @fernandosicos Před 2 lety +1

    greatings from Brazil!

  • @_ahahahahaha9326
    @_ahahahahaha9326 Před 2 lety +1

    Really learn a lot from you

  • @beibeima524
    @beibeima524 Před 2 lety

    Hi Josh, Thanks so much for the video! My Question is should we do one hot encoding before or after splitting the data into training and testing set? Thanks!

    • @statquest
      @statquest  Před 2 lety

      As long as all categories are in both sets, it doesn't matter.

  • @sabyasachidas142
    @sabyasachidas142 Před 2 lety

    Thanks Josh for the awesome tutorial. I've one question. While one hot encoding, we also pass drop_first=True as an argument to avoid multicollinearity while performing regression. But we didn't do it for this classification problem. Is it not required?

    • @statquest
      @statquest  Před 2 lety

      Multicollinearity is a problem with regression, but not with tree based methods.

  • @mcmiloy3322
    @mcmiloy3322 Před 3 lety

    Really nice video. I thought you were actually going to implement the tree classifier itself, which would have been a real bonus but I guess that would have taken a lot longer.

  • @amalsakr1381
    @amalsakr1381 Před 5 měsíci +1

    Thank you for your powerful tutrial

  • @avramdagoat
    @avramdagoat Před 9 měsíci +1

    great insight and refresher, thank you for documenting

  • @chaitanyasharma6270
    @chaitanyasharma6270 Před 3 lety +1

    i loved your video support vector machines in python from start to finish and this one too!!! can you make more on different algorithms?

  • @TalesLimaFonseca
    @TalesLimaFonseca Před 2 lety +1

    Man, you are awesome! Vai BRASIL!!!

  • @ksheerabdhisamantaray7410

    Very good tutorial and your channel is blessing. I have one doubt, in your video of pruning (clearly explained), you mention to find the alpha values first build a tree on both training and testing data and then use those values on train dataset. But here you directly did on train set. Is there a reason for this? or if you could mention what is the better way out of these two.

    • @statquest
      @statquest  Před rokem

      In the pruning video I should have said "all data used for training".

  • @shindepratibha31
    @shindepratibha31 Před 4 lety

    I have almost completed the Machine learning playlist and it was really helpful. One request, can you please make a short video on 'handling the imbalanced dataset'?

    • @statquest
      @statquest  Před 4 lety

      I've got a rough draft on that topic here: czcams.com/video/iTxzRVLoTQ0/video.html

  • @abdelrazzaqabuhejleh6625
    @abdelrazzaqabuhejleh6625 Před 6 měsíci

    Thank you for this valuable explanation :D
    I have a question tho, what do we learn from the graph in 51:48?

    • @statquest
      @statquest  Před 6 měsíci

      This shows how different trees trained with different subsets of data have different accuracies.

  • @patite3103
    @patite3103 Před 3 lety

    thank you for this video! Would it be possible to do a similar video with random forest and regression trees?

    • @statquest
      @statquest  Před 3 lety

      I don't like the random forest implementation in Python. Instead, if you're going to use random forests, you should do it in R. And I have a video for that: czcams.com/video/6EXPYzbfLCE/video.html

  • @michelchaghoury870
    @michelchaghoury870 Před 2 lety +1

    MANNNN so usefull please keep going

  • @hafiznadirshah3253
    @hafiznadirshah3253 Před 3 lety

    Hey Josh, thanks for another awesome video. Had a couple of questions :
    1) At 40:20 when we initialise the classifier, what will happen if we choose the parameter (splitter = 'random')? In which situations would we want the split to occur randomly at each node, rather than by the default of best(least) gini impurity?
    2) In the final tree at 58:36 - for the bottom left leaf node, does value = [78, 9] mean the lead node contains 78 observations with no heart disease and 9 with no heart disease?
    3) At 42:30, to assess whether the tree has overfit the training data, can't we also retrieve the accuracy on both training and test data using clf_dt.score()? If there is an overfit, training accuracy should be significantly higher than test accuracy?

    • @statquest
      @statquest  Před 3 lety

      1) To be honest, I don't really know. The documentation says you will get the "best random split", but I don't know what that means.
      2) Yes.
      3) Presumably.

    • @hafiznadirshah3253
      @hafiznadirshah3253 Před 3 lety +1

      @@statquest awesome, Thanks for the quick revert again. Ordering a BAM tee as we speak!

    • @statquest
      @statquest  Před 3 lety

      @@hafiznadirshah3253 Hooray!

  • @InternatoMiguel
    @InternatoMiguel Před 7 měsíci

    Hello Josh, thank you so much for another great video! Did you end up doing a webinar on inputting values? If so, where can I find it? :)

    • @statquest
      @statquest  Před 7 měsíci

      Maybe. What time point in the video, minutes and seconds, are you asking about?

    • @InternatoMiguel
      @InternatoMiguel Před 7 měsíci

      @@statquest 18:29

    • @statquest
      @statquest  Před 7 měsíci

      @@InternatoMiguel Unfortunately, except for imputing data for Random Forests, I haven't covered that topic very much. However, if you are interested in how Random Forests do it... czcams.com/video/sQ870aTKqiM/video.html

  • @anbusatheshkumarpalanisamy8798

    Hi Josh, how are we getting 132 sample in the left node of the final tree, shouldn't be 118 from the root node?

    • @statquest
      @statquest  Před 4 lety

      132 of the samples (both those with heart disease and those without heart disease) have ca values 0.5 go to the right. For more details, see: czcams.com/video/7VeUPuFGJHk/video.html

  • @srmsagargupta
    @srmsagargupta Před 3 lety +1

    Thank you Sir for this wonderful webinar

  • @paulovinicius5833
    @paulovinicius5833 Před 3 lety +1

    I know I'll love all the content, but I start liking the video immediatly bc of the music! haha

  • @Theviswanath57
    @Theviswanath57 Před 3 lety

    In the accompanying theory videos you mentioned to compute ccp_alphas we are supposed to use full data ?

    • @statquest
      @statquest  Před 3 lety

      We use the full testing dataset.

  • @SaurabhKumar-mr7lx
    @SaurabhKumar-mr7lx Před 4 lety +1

    Hi Josh, I see in Sklearn all the tree based ensembled algorithms has ccp_alpha as tuning parameter. Is it advisable to do so, rather is it feasible to do so for hundreds of trees (especially when trees are randomly created) or should we tune standard parameters like learning rate, no. of trees, loss function etc.

    • @statquest
      @statquest  Před 4 lety +1

      In this video I tune ccp_alpha (starting at 46:31 ). It spares us the agony of tuning a lot of separate parameters.

    • @SaurabhKumar-mr7lx
      @SaurabhKumar-mr7lx Před 4 lety

      @@statquest Just wondering is it possible to tune this for random forest. Since we are creating 100's of trees with randomly selected features for every tree. As far as I understood, ccp is a tree specific parameter. Please give some insight of this in your next session. Hope so my query is relevant 🙂

    • @statquest
      @statquest  Před 4 lety +2

      @@SaurabhKumar-mr7lx With Random Forests, the goal for each tree is different than when we just want a single decision tree. For Random Forest trees, we actually do not want an optimal tree. We only want something that gets it correct a little more than 50% of the time. So in this case, we just limit the tree depth to 3 or 4 or something that, rather than optimize each tree with cost complexity pruning.

    • @SaurabhKumar-mr7lx
      @SaurabhKumar-mr7lx Před 4 lety +1

      @@statquest got it ....... Thanks for explaining this Josh.

  • @zhihaoxu8119
    @zhihaoxu8119 Před 2 lety

    Hi Josh! Thanks for the content. I wonder where is the webinar related to how to handle missing data you mentioned in this video? Thanks!

    • @statquest
      @statquest  Před 2 lety

      See: czcams.com/video/wpNl-JwwplA/video.html

    • @zhihaoxu8119
      @zhihaoxu8119 Před 2 lety +1

      @@statquest Thank you so much!

  • @aalaptube
    @aalaptube Před rokem

    You mentioned sklearn is not great for lot of data. In terms of size of data, how much is a lot? 1, 10, 100GB? For those cases, what are the options?
    Also, what does the function cost_complexity_pruning_path do? How did it build the array of the value of alphas? In the other StatQuest video of using α, we just checked some specific values...

    • @statquest
      @statquest  Před rokem

      The answer to the first question depends on how much time you have on your hands. As for the second question, see: scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

  • @prnv5
    @prnv5 Před 2 lety

    Hi Josh! I'm a HS student trying to learn ML algorithms and your videos are genuinely my saving grace. They're so concise, information heavy and educational. I understand concepts perfectly through your statquests, and I'm really grateful for that.
    One quick question: The algorithm used in this case to build a decision tree: is it the CART algorithm? I'm writing a paper on the CART algorithm and would hence like to confirm the same. Thanks again!

    • @statquest
      @statquest  Před 2 lety +1

      Yes, this is the "classification tree" in CART.

    • @prnv5
      @prnv5 Před 2 lety +1

      @@statquest Thank you so much 🥰

  • @lucillewiid5476
    @lucillewiid5476 Před 5 měsíci

    Hi, Josh, recommend your videos to all my students and love watching and learning from them 👍. Can we still download this notebook?? Or do we need to buy it?? Regards from South Africa!

    • @statquest
      @statquest  Před 5 měsíci +1

      This notebook has always been for sale and is still for sale if you would like it.

  • @mahdimj6594
    @mahdimj6594 Před 4 lety +1

    Neural Network Pleaseee, Bayesian and LARS as well. And Thank you. You actually make things much easier to understand.

  • @cageman301
    @cageman301 Před 2 lety

    I understand that Logistic Regression requires continuous variables to be separated into bins and coarse classed to ensure that the final model is created with binary variables only, does this apply to Decision Trees as well?

    • @statquest
      @statquest  Před 2 lety

      If you want to learn more about logistic regression, and how it works, see: czcams.com/play/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe.html and if you'd like to learn more about decision trees, see: czcams.com/video/_L39rN6gz7Y/video.html

  • @GokulSKumar-uz9dy
    @GokulSKumar-uz9dy Před 4 lety +1

    Great video sir.:)
    I just have a doubt in one part. At 52:14 instead of using X_train and y_train, arent we supposed to use the entire dataset(i.e. X_encoded and y) while implementing cross-validation?
    Also later in the video at 52:54, the value for alpha was found by using only X_train and y_train data in the cross-validation.

    • @statquest
      @statquest  Před 4 lety +1

      There are different schools of thought about what datasets you should use for cross validation. The most common one, however, is to do it as presented in this video.

    • @GokulSKumar-uz9dy
      @GokulSKumar-uz9dy Před 4 lety

      @@statquest Thanks a lot!
      Just in case if it might be useful, I tried using the entire dataset in cross-validation before splitting it into train and test. I could see from the corresponding confusion matrix that the model predicted correctly 90% of people not having a heart disease whereas, there was no increase in the percentage of people having heart disease.
      Again loved the video a lot. Waiting for the next webinar.:)

    • @KeigoEdits
      @KeigoEdits Před 2 lety

      @@statquest Hey Josh sir, actually after reading this comment I really went for cross-validation with the whole dataset as in above comment I also read that you mentioned that we should take whole dataset in case of small datasets and what I personally think is 297 datapoints dataset can be called small. This gave me better results at alpha=0.021097 and it was varying from 0.74 and 0.88. What are your views on it?

    • @statquest
      @statquest  Před 2 lety

      @@GokulSKumar-uz9dy It really depends on how noisy your data is and what you hope to do with it.

  • @khashayarsalehi6779
    @khashayarsalehi6779 Před 2 lety

    Thanks for this great tutorial! I have a question though, I tried decision tree regressor but at the end the pruned tree returns the same high value too far out of the range for all the inputs! Also the accuracy for train and test sets decreases by increasing of alpha! Can you help me to understand how the tree is returning the same unreasonable value for all the inputs?

    • @statquest
      @statquest  Před 2 lety

      Unfortunately I don't have time to help you with your code... :(

    • @khashayarsalehi6779
      @khashayarsalehi6779 Před 2 lety +1

      @@statquest You've already helped me with your awesome tutorial! It's OK :) Triple BAM!

  • @dineshmuniandy9519
    @dineshmuniandy9519 Před 4 lety

    Hi Josh, is it possible to apply Cost Complexity Pruning to Regression problems (where the predicted target in Continuous) ? What are the modifications required to the code ?

    • @statquest
      @statquest  Před 4 lety +1

      Cost complexity pruning works great with regression trees. Here's a video that describes it: czcams.com/video/D0efHEJsfHo/video.html

    • @dineshmuniandy9519
      @dineshmuniandy9519 Před 4 lety

      @@statquest Just a quick question. What should this line be changed to:
      df = pd.DataFrame(data={'tree': range(5), 'accuracy': scores})
      If the target is not a class of 5 different ordinal values, but rather continuous values ?

  • @pfunknoondawg
    @pfunknoondawg Před 3 lety +1

    Wow, this is super helpful!

  • @krishanudebnath1959
    @krishanudebnath1959 Před 2 lety +1

    love the tabla and ur content

    • @statquest
      @statquest  Před 2 lety

      Thanks! My father used to teach at IIT-Madras so I spent a lot of time there when I was young.

  • @pfever
    @pfever Před rokem

    Thank you, this video is so helpful! :)
    I have a question, Categorical data is transformed utilizing one-hot encoding. What about nominal data?
    For example student year: 1, 2, 3, 4. In this case the order is meaningful. I guess we should we keep the features nominal data as float64?

    • @statquest
      @statquest  Před rokem +1

      Yep.

    • @pfever
      @pfever Před rokem +1

      @@statquest Thank you for always replying to my comments! StatQuest! 😁👍👍👍

  • @6223086
    @6223086 Před 3 lety

    Hi Josh, I have a question, at 1:01:03 , if we interpret the tree, on the right split from the root node, we first went from a node with Gini Score of 0.346 (cp_4.0

    • @statquest
      @statquest  Před 3 lety +1

      For each split we calculate the Weighted Average of the individual Gini scores for each leaf and we pick the one with the lowest weighted average. In this case, although the leaf on the left has a higher Gini score than the node above it, it has fewer samples, 31, than the leaf on the right, which has a much lower Gini score, 0.126, and more samples, 59. If we calculate the weighted average of the Gini scores for these two leaves it will be lower than the node above them.

  • @aleksandartta
    @aleksandartta Před 2 lety

    How to implement pipeline with cost complexity? Consider the marking part which start before 49:00... Thank in advance! You are the best teacher...

  • @vipanpatial2243
    @vipanpatial2243 Před 2 lety +2

    BAM!! You are best.

  • @haoranzhang3993
    @haoranzhang3993 Před 2 lety

    Thank you Josh for the nice videos! Questions: 1) What is accuracy? Is there a relationship between Gini impurity/Sum of squared residuals and accuracy (i.e. Lower Gini impurity means higher accuracy)? 2) Once we create a tree classifier with a certain alpha, will different training data sets give different fitted trees? And how are they different?

    • @statquest
      @statquest  Před 2 lety

      Accuracy is the percentage of the data that are correctly classified. The lower the gini index, the higher accuracy. Different training datasets will probably give a different value for alpha, so it's good to use cross validation to find the best value.

    • @haoranzhang3993
      @haoranzhang3993 Před 2 lety

      @@statquest Thank you Josh for the quick reply. This is very helpful! But AFTER the optimal alpha value is identified via the cross validation, will different training datasets give different final pruned trees in the below query? clf_dt_pruned = clf_dt_pruned.fit (X_train, y_train)

    • @statquest
      @statquest  Před 2 lety +1

      @@haoranzhang3993 Yes

  • @alpatul
    @alpatul Před 2 lety +2

    This is great, do you have any more python webinars related to machine learning? I would love to go through them.

    • @statquest
      @statquest  Před 2 lety

      I'm working on more right now.

    • @oliveryoule11
      @oliveryoule11 Před 2 lety

      @@statquest At 19 minutes you say you have plans for a whole webinar on missing data! This is what I need. Where can I find it or is it still in production? :D

    • @statquest
      @statquest  Před 2 lety

      @@oliveryoule11 Dang! I'd forgotten about that. I guess you could say it's still in production. :)

    • @oliveryoule11
      @oliveryoule11 Před 2 lety

      ​@@statquest Thanks for replying! I can see how easy it is to forget! You have so much content its unreal! V impressive! I just purchased your Notebook through the link - but it doesn't appear to arrived in my inbox. Can you advise? I am also strongly considering paying for your Patreon account. I currently pay for Datacamp - but your material is so much better!

    • @statquest
      @statquest  Před 2 lety

      @@oliveryoule11 Wow! Thanks for supporting me and I'm sorry you had trouble purchasing the notebook. If you contact me through my website, I can send it tor you directly: statquest.org/contact/