Video není dostupné.
Omlouváme se.

Data Splitting using Cross Validation and Bootstrap in R

Sdílet
Vložit
  • čas přidán 18. 08. 2024
  • ☕If you would like to support, consider buying me a coffee ☕: buymeacoffee.c...
    For one-on-one tutoring/consultation services: guide-tree-sta...
    I offer one-on-one tutoring/consultation services for many topics related statistics/machine learning. You can also email me statsguidetree@gmail.com
    For rcode and dataset: gist.github.co...
    This video is a tutorial in R of various data splitting (i.e., model validation, data partitioning) methods with the caret package to estimate accuracy and error. I go over the following methods: test train hold out, leave one out cross validation, k-fold cross validation, repeated k-fold cross validation, and bootstrap 632. The dataset I use is the heart disease dataset. For a review on logistic regression models, please check out the video:
    • Logistic Regression wi...
    For formulas used to calculate the metrics provided in the output from the confusion matrix:
    rdrr.io/cran/c...

Komentáře • 6

  • @statsguidetree
    @statsguidetree  Před 2 lety +2

    Here is the r code:
    ##### Data splitting methods for model validation
    ##### Some method reviewied:
    # Test/train
    # Leave one out cross-validation (LOO CV)
    # k-fold cross-validation (k-fold CV)
    # repeated k-fold cross-validation (repeated k-fold CV)
    # bootstrap resampling the 632 method
    ##### When should it be used and why is it important.
    # Problems of overfitting adversly impact the
    # generalizability of the model.
    ##### Differences between linear vs. logistic regression
    #####################################################################################
    # Load dataset for example
    #####################################################################################
    # Dataset of patients with heart failure
    # find and load dataset downloaded from
    # www.kaggle.com/andrewmvd/heart-failure-clinical-data
    heart

  • @meme31382
    @meme31382 Před 2 lety +2

    thanks for your time, very complete information from the video for model validation. I just have a doubt, I have seen that some people divide the data into the training and test sets, and then the training set is given the k-fold CV, and in your code the k-fold CV is applied to the set of complete data, which is more correct? and why?

    • @statsguidetree
      @statsguidetree  Před 2 lety +2

      That is a really good question. Generally speaking you do not need to test/train split your data first before using k-fold CV. If your goal is to validate your model (i.e., evaluate the generalizability), you need to test it against data the model has not already seen -- and since you are already doing that with the k-fold CV you would not need to start it off with a test/train split.

  • @user-cl6qi3om8x
    @user-cl6qi3om8x Před 6 měsíci

    I noticed the [method = "glm" ] was used in the LOOCV method, but what if you have a nominal dependent variable [outcome of 0/1/2], how can we run LOOCV on that? Any help is appreciated.

  • @user-dm2xg1ue2m
    @user-dm2xg1ue2m Před 7 měsíci

    Very informative video.
    I am trying to train an RF model where I have 40+ independent variables. I am currently using k-fold CV with 3 repeats. It is taking a lot of time. How can I reduce the model training time? I am afraid if I will use bootstrap method, it may take even longer time.. 2-3days!!
    Any suggestions??

  • @user-cl6qi3om8x
    @user-cl6qi3om8x Před 6 měsíci

    I've got nominal data (dependent variable outcome of 0/1/2], how do you run LOOCV on multinom model? Any help is appreciated.