eXtreme Gradient Boosting XGBoost Algorithm with R - Example in Easy Steps with One-Hot Encoding
Vložit
- čas přidán 19. 05. 2024
- Provides easy to apply example of eXtreme Gradient Boosting XGBoost Algorithm with R .
Data file and R code: github.com/bkrai/Top-10-Machi...
Machine Learning videos: goo.gl/WHHqWP
Timestamps:
00:00 eXtreme Gradient Boosting XGBoost with R
00:04 Why eXtreme Gradient Boosting
00:34 Packages and Data
02:02 Partition Data
03:25 Create Matrix & One Hot Encoding
07:35 Parameters
09:59 eXtreme Gradient Boosting Model
11:51 Error Plot
16:50 Feature Importance
18:00 Prediction and Confusion Matrix - Test Data
24:03 More XGBoost Parameters
Includes,
- Packages needed and data
- Partition data
- Creating matrix and One-Hot Encoding for Factor variables
- Parameters
- eXtreme Gradient Boosting Model
- Training & test error plot
- Feature importance plot
- Prediction & confusion matrix for test data
- Booster parameters
R is a free software environment for statistical computing and graphics, and is widely used by both academia and industry. R software works on both Windows and Mac-OS. It was ranked no. 1 in a KDnuggets poll on top languages for analytics, data mining, and data science. RStudio is a user friendly environment for R that has become popular.
Best video on the internet on XGBoost, you just saved my paper. Thanks a lot :)
You're welcome!
I agree 100% with you.
Thanksssss a lot Prof! You sent me the link to this video and it REALLY helps. But just as someone suggested in the comment, the parameters in the model are very KEY and a much detailed explanation of them and the algorithm as a whole will REALLY REALLY be APPRECIATED too. I am blessed to be a subscriber of your videos!
Thanks for your comments and suggestion!
I got a much higher level of clarity in the concept of xgboost model and parameter usage with this video. Thanks a lot Sir
Thanks for comments!
Hi Bharatendra, I derive a lot of value from your tutorial that strike the right balance between being simple yet very useful. Love them!
Thanks for your feedback and comments!
Such an elaborate explanation.
Please keep posting such videos. They will be very useful for the community.
I've benefitted a lot from this video.
Thank you, I will
Best clarity so far on XGBoost, it helped a lot in my final project and in learning more about this algorithm compared to GBM.
Thanks for comments!
Respect you sir. The kind of knowledge you are sharing from Massachusetts is very very helpful. Thank you so very much Sir.
Thanks!
You are an skillful tutor. Keep going on and Happy New Year!
Happy New Year 2018!
Thank you so much, this is the video i have been looking for long, didn't find anything interested, you have explain everything in detail and its interesting too.
Thanks for comments!
Thank you Sir for making it so easy
This helped so much on a classification project I am doing. Much thanks!
You're very welcome!
@@bkrai ee
Fantastic tutorial, thank you!
Thank you for the tutorial.
Given that you have a binary target, I was wondering why you haven't used objective='binary:logistic' and eval_metric = 'logloss'.
Is there a downside to using "multi:softprob" for a binary classification problem when it is typically used for multiclass classification where n>2. Appreciate if you could help clarify this.
That was a very good tutorial! I wonder if and how we could use the cross validation for choosing the eta, gamma, iteration etc parameters. I would be happy to have any suggestions.
After weeks of searching for videos on using XGB and predicting continuous variable, I could not find any decent videos... nor were any of them as well explained (and entertaining) as your videos. Please make one for the community? Best wishes from London, UK
Thanks for the suggestion and comments, I'm adding this to my list of future video.
Thank you for this tutorial. Awesome. Step by step explanations made things much easier to understand
You're very welcome!
You may also find this useful:
czcams.com/video/GmkHvDs0GG8/video.html
@@bkrai Thank you very much
@@bkrai Thank you very much
when you find time kindly have a look at my channel on R. Everything is like a standalone application
czcams.com/channels/DmEAmoLuyE0h61aGpthGvA.html
You are welcome!
Very easy to follow, no errors in code, just great.🤓🙂
Great to hear!
Thank you sir! awesome explanation skills with depth of algo
Thanks for your comments and finding it useful!
Hi Sir, your videos are great. Let me ask you question: I have read that it is possible to implement survival analysis (cox regression) with the XGboost package, indicating "survival:cox" as the learning task parameter. I haven't found any tutorial on this issue. Do you know if it is necesary to make an extra work? for example to specify the time variable in someplace else? Thanks in advance.
Thanks for the model. A big help for me.
Thanks for comments!
your explanation is awesome !
Thanks!
Cheers Amazing Video Mate!
Thank you ,Sir, for explaining the model so well. I am doing something similar with my data. How can I show the probabilities of predictors. (similar to the one in decision tree)
Very very informative. Thanks!
Thanks for comments!
Hello Sir, can you please share an example where the response variable is continuous?
thank you for your sharing.
we can also increase the range on the y axis by using the following lines
plot(e$iter, e$train_mlogloss, col = "blue", type = "l", ylim = c(0, 1))
lines(e$iter, e$test_mlogloss, col = "green")
legend("topright", legend = c("Training Error", "Testing Error"), lty=c(1,1), col = c("blue", "green"))
but i guess for the purposes of this video not using the ylim parameter can be intentional and warranted.
Thank you for the great video as always
thanks!
Thank you sir, This is very helpful
thanks!
Excellent!
thank you Sir...
Simply Awesome and excellent ..
Thanks for comments!
Thanks for the amazing tutorial for the XGboost!. I can't believe that you make every application of machine learning so easy. I really want your help figure out applying XGboost on time-to-event data. There are so limited resources in terms of XGboost using cox model. Do you have any suggestions? thanks
I don't have at his time, but have added it to my list.
Thank you it was very helpful!!
Thanks for comments!
Hi sir. Your video is very good and easy to understand. I have one question. What is the classifier algorithm used in the xgboost package for classification case? I had read some info in other website that the package includes "tree learning algorithms". Is it decision tree algorithm? thank you in advance for your clarification.
Thank you so much
Excelent video, thanks!
Thanks for comments!
Sir, Can you please make a video on stacking model for different DL models. Thanks a lot for informative videos sir.
Hi Sir, Let me ask you a question. In a binary classification context, How do you predict when it is not possible to know the values of the target or outcome variable in a forecasting scenario? I mean you need to forecast a result and have a new dataset without the response variable, that is, you dont know if a student will be admitted or not, but need to make a prediction/using xgboost.
I tried to do this by setting in the "test set" (the new dataset without the response variable) an outcome variable with a fixed value -0 for instance- to be able to run the xgboost, however the prediction is pretty unaccurate.
Thanks very much!
Thank you very much. You helped me a lot.
Thanks for comments!
hi sir. what line of code will i add if i want to see the confusion matrix that will also display 95% CI and Test P-value? great lecture. thank you.
Thank you for the tutorial..Really helped in understanding. I have a question why can't we do dummy encoding for categorical variables in xg boost??
You may try. It should work fine.
thanks a lot for the explanation.
Thanks for comments!
I am enjoying watching your videos starting from the simplest to more complicated ones! Thank you Dr. Rai for your great explanation. I have one question, though: When you divide the data into train and test data, you are using data[ind==1, ] and data[ind==2, ]; it is not clear to me how this magically works; however, what I see is data[x, y], where the only values that y can take are blank, and integers from 1 to 400, and the only values y can take are blank, and integer values from 1 to 4. Can you explain to me what is going on? Or, is there any thing that I am missing?
You can refer to this for explanation:
czcams.com/video/RBojq0DAAS8/video.html
Very useful , thank you!
Thanks for comments!
Thanks for the video!
A quick question though: What's the motivation behind the 'prob' vector in 'ind
prob is the probability. For more details about data partitioning, you can look at this link:
czcams.com/video/aS1O8EiGLdg/video.html
Also date variables are handled differently. Probably I'll do a video about it later.
Thanks very much for this tutorial - definitely made things easier to understand.
I have a question regarding "objective" = "multi:softprob" in the parameter section. The admission problem in the example deals with a logistic problem, right? So why should we use multi:softprob instead of binary:logistic? If I try the model with this binary:logistic input my models fails.
Would be great if you could help me out on when to use what objective! Thanks.
Multi works for 2 or more levels.
very excellent explanation lot of thanks
i have one doubt it is possible to use image data specially satellite data
For image data deep learning is more effective. You can explore ‘deep learning’ playlist on this channel.
Hello Sir,
In a real scenario, where we have a separate test data with no dependent variable. How will the sparse.matrix.model work?
Dear Rai, I hope you doing well. I have 1 question. I am doing a machine learning model using the RandomForest and XGBoost algorithms. My data is a survey of samples derived from a large population. My data has a sampling weight which is the number of individuals in the population each respondent in the sample is representing. How can I apply this sampling weight in my ML model? The data also contains strata and clusters. Do I have to keep the sampling weight, strata, and cluster variables with my features?
Thanks a lot
Thank you so much for your instructive and insightful tutorial!
I've one question:
Do I only need one hot encoding for my inputs / features?
What about the outputs, is xgboost able to forecast a categorical variable as a label?
Or should I make one hot encoding for my labels as well?
Kind regards
Jonathan
For XGBoost, response variable also needs to be numeric. In the example that I used, admit is a factor variable but since it has two values 0 and 1 in numeric form, we didn't do anything. For further explanation about variables, you can also refer to this link:
cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html
Thank you very much for your explanations and the link!
What is in your opinion in multi class cases more suitable - Suppose we have one categorical variable with 10 classes (0 to 9) every number is a class :
What do you think is better?
1. Make one model to forecast this categorical variable -> getting 10 different probabilities which sum up to 1.
2. Make 10 different models which forecast for each of the 10 classes yes or no (0 an 1).
In the end we take the model with the highest probability for the yes-case as the forecast
Thanks in advance
Jonathan
The code has room for improvement. For instance, in the splitting of the data, instead of using sample(), you can use createDataPartition() instead, in order to preserve the proportion of the categories in Y variable. The improments goes from 0.7066667 to 0.7375.
Another improvement is to used, say, 10 fold cross validation instead, and used caret R-package with train()
Thanks for sharing!
I have a basic question. in logistic regression using lm function, we get model with predictors considered in that. but here, I don't know which are the predictors considered in the bst_model. could you please guide me to extract those predictors from the bst_model. Thank you very much
Thank you! I wish you would use caret more, though.
Thanks for the suggestion!
I have seen your lecture on logistic regression and randomForest as well. They are awesome. Do we require cross-validation in these ML methods? I haven't observed any cross-validation step in your lecture on LR, RF, and xgboost.
I've split data in to train and test. But no harm in doing CV.
can we use xgboost and adaboost for multiclass models?
When using 'adaboost' I'm getting the following error
"Error: Dependent variables must have two levels"
My dataset has 3 levels. You inputs will be helpful and appreciated!
How do I interpret cover etc? Also how can we do grid search here for optimisation?
thanks sir and brilliant
Thanks for feedback!
Hello,
is it possible to change the cutoff of the XGB-Model prediction?
In my model evaluation phase I got the case where the AUC in my ROC curve of a model to another model was higher
despite of a clearly worse confusion matrix and accuracy. My guess is that this could be a cutoff issue.
Kind regards
Jonathan
ROC curve already makes use of various cutoffs to draw the curve. With one cutoff value we will just get one point and not a curve. Looking at two curves can give you better idea about the reasons behind AUC difference.
Thank you for explaining clearly. If I have five character indpendent variable in the dataframe and I don't want to drop it, How can I proceed with this concept. It means how the character would be converted to numeric data
You can do one-hot encoding as shown in the video.
Great video! Do you know how to get confidence or prediction interval for xgboost in r? Thanks
You can get more details here:
czcams.com/video/hCLKMiZBTrU/video.html
Thanks for the instructive video Sir. I am using a test set that does not contain the dependent variable row because i am supposed to predict that column in a regression problem. How should i edit the script for test_label and watchlist? Thank you.
You can try this:
new_matrix
Thanks Rai for your help tutorial! It really helps me to understand and do XGBoost in R. Here I have a question, if I want to do with the regression problem, can I use the same code? or any parameter should I modify? Hope to hear from you soon.
You can see an example here:
czcams.com/video/hCLKMiZBTrU/video.html
You can also get some practice by doing this competition:
czcams.com/video/Dn028hqWnUA/video.html
@@bkrai , really helpful! Thanks again for your detail tutorial. Wish you all the best!
You are welcome!
Thankyou Sir,
Please also give a guidance about how to install the package LightGBM in R and its uses
Thanks, I've added it to my list.
absolute legend!
Thanks for comments!
Thanks for the video. In what scenario we should use eXtreme Gradient Boosting!
You can use it for better accuracy and faster run compared to many other methods.
thanks a lot :)
Great tutorials :)
Thanks for comments!
you are a legend!!!
Thanks for comments!
Hi Bharatendra, nice and veryuseful video.. i have a question.. in my case i have around 4.5 lacks observations and 250 features.. am trying to run XGBoost, its taking some time, thats ok. but not able to remove the XG boost... Note: my data is highly class imbalanced where 0's 75% and 1's 25%.. do you suggest to use XGBoost here? thanks !
I would suggest take care of class imbalance problem (CIP) before running XGBoost. It will improve accuracy significantly. Here is the link for CIP:
czcams.com/video/Ho2Klvzjegg/video.html
Is there a video for checking the model using chi-square?
Thank you so much for the tutorial.
I have a question
How to plot ROC and AUC curve on the same data set. Can you provide the code for ROC and AUC curve.
Here is the link:
czcams.com/video/ypO1DPEKYFo/video.html
Hello Bharatendra Rai,
did you make a video about setting up a feature selection in R?
It would be very useful for the case if you have lots of features / inputs and you want to find out
which of these features are relevant to determine a feature subset for the classifier.
Kind regards
Jonathan
I'll be doing one in August.
Bharatendra Rai Looking forward to it! 👍 Thank You for your Deep and to the point Data Science tutorials - I recommend it in Karlsruhe every student who wants to run ML models in R.
Bharatendra Rai Looking forward to it! 👍 Thank You for your Deep and to the point Data Science tutorials - I recommend it in Karlsruhe every student who wants to run ML models in R.
Bharatendra Rai Looking forward to it! 👍 Thank You for your Deep and to the point Data Science tutorials - I recommend it in Karlsruhe every student who wants to run ML models in R.
Thanks for your comments and recommendations!
Is there any example of using XGboost to make prediction ? It seems that this video is for the classification case.
Thanks for the video. I have a question why did not you use objective" = "binary:logistic"?
Yes, that should be more appropriate.
Hello Sir, Can we add hyper parameter tuning in XGBOOST. If yes, then how
Thank you so much sir for your in-depth tutorials. Sir could u please post github link for the code as well.?
Link to the code is in the description area below the video.
Link ti GitHub: github.com/bkrai/Top-10-Machine-Learning-Methods-With-R
Hello Professor, thank you for this video.
I'm receiving this error after attempting to assign the same line of code you have in line 22. Any ideas on how to resolve?
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) : The length of labels must equal to the number of rows in the input data
Following provides some clue "length of labels must equal to the number of rows in the input data".
Can you please upload videos LSTM in Keras in R for numerical categorical and multiclass outcomes....it would be really great
Thanks for the suggestion! It's on my list for future videos.
Also Prof Rai, I am building an Ensemble model of Random Forest and Xgboost with R. My response variable has 2 levels 'Low' and 'High'. The response variable's scale is a factor in R. Without converting these '0's and '1's, can I build the model? Also, some of my predictor variables have levels A, B, C, D and E and their scales as detected my R are factors. Do I have to convert these to Zeros and Ones numbers even though they are factors before I use them?
When you use random forest, you do not need to convert categorical independent or dependent variable to numeric. But you definitely need numeric variable when using xgboost.
Your explanation helped a lot. Thanks. I am building an ensemble of Random Forest and Xgboost on a classification problem. I have imbalanced data so used your video to balance ONLY my training data. (I hope that's all that I need to do in terms of the balancing?). After balancing, I applied your One-Hot encoding tutorial on both my balanced Train data and my unbalanced Test data. My Xgboost is running well though I am yet to test it. BUT the problem is the Random Forest. When i pass the data through the RF I get the error message below::::
Error in t.default(x) : argument is not a matrix
In addition: Warning messages:
1: In randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, :
The response has five or fewer unique values. Are you sure you want to do regression?
2: In is.na(x) :
is.na() applied to non-(list or vector) of type 'externalptr'
What could be the solution to it? Your help is greatly be appreciated, Prof Rai!
First I should appreciate for providing such helpful educational channel. Thanks a lot Sir. Kindly I have a question regards factor parameter.
Should I turn all integer values to Factor? cuz I got an error that " xgb.DMatrix(data = as.matrix(train), label = train_label) :
REAL() can only be applied to a 'numeric', not a 'integer'"?
Could you please explain how did you choose the rank column to turn into the Factor and matrix variable?
Best Regards,
I used rank as an example for dealing with factor variables. In your dataset if you have any factor variable, you can handle it in a similar manner.
Would you consider using caret and calling xboost there directly? Is there a benefit from using this direct method versus using caret? Thank you
That should also work fine. As long as we use the same method, model performance is not likely to be significantly different.
Hi Rai. Hope everything is going good. I am currently working on an ML algorithm with a continuous outcome variable. I am new to a regression model. I want to develop randomForest and XGBoost regression. Can I ask for any reference video and codes related to a regression algorithm using RnadomForest and XGBoost
Refer to:
czcams.com/video/hCLKMiZBTrU/video.html
Hi Bharatendra.. i tried searching Bagging/Boosting and SMOTE videos from your playlist.. aren't they out yet? if not yet , waiting to see them :)..
Not yet.
Hi Rai. Great job. I have one question. How can we construc ROC&ACU for the XGBOOST model
See if this help. It has more detailed coverage:
czcams.com/video/ftjNuPkPQB4/video.html
@@bkrai Thank you so much
You are welcome!
Please give an explanation about the algorithm so that its helpful to understand much better
Hi Prof., I have come again with a question since I am learning a lot with your videos. Could you please explain very well the 'eta' parameter in xgboost and also I want to report the AUC metric in my xgboost model and I need your guidance. I have seen examples on google but I get error when i try. I am making a presentation on xgboost soon. Your help will be appreciated.
eta is the learning rate. When is is high, computation is faster, but you may miss the optimum. When it is low, computation is slower, but there is a better chance of hitting the optimum. Depending on the data size and problem, we try various values to explore what is best for a given problem. For AUC you can try this:
czcams.com/video/ypO1DPEKYFo/video.html
Hi Sir,
Why have you used "-1" in the sparse.model.matrix" function?
Does it specify that the "first column" is not to be included or does it not include only one column i.e. the "response" variable?
No. of classes are 2 so If we put -1 those classes will become 0 and 1 because in this case 0 is for not admitted and 1 is admitted
Thanks for the update!
here is an update: “-1” removes an extra column which this command creates as the first column.
Thank you for your valuable video. I have a question in bst_model step, it is not work. My data has number of class is 122. When I run, R result displays error: label must be in [0, num_class). I try so many nrounds value in range 0 and 122, but haven't worked. Hope to get your response. Many thanks!
I think 122 is too many classes. make sure you have enough data for each class otherwise there could be issues.
@@bkrai Do you have any solution to handle, Dr.?
Difficult to say much without looking at data
can you please tell me which editor you used ?
I use final cut pro.
I have on query. Here in this example we are aware about response variable in test set as we have divided actual data into 80/20. But in actual life like in Kaggle competitions we need to predict on Test set given by Kaggle where we need to predict on Response variable. So how that will fit into above code. ie how to do prediction on actual Test set in xgboost. Thanks in advance.
This code will not change much. But you will definitely have to make some adjustments before you can correctly submit your file on Kaggle. You can refer to this example:
czcams.com/video/4ld-ZfrCc0o/video.html
Q: what's stopping someone from just changing all their variables to numeric types and skipping over the one-hot encoding process altogether? Does it hurt the prediction?
I would suggest try both and compare results.
Hi Rai, my test data doesn't have response variables, I need to predict them. What should I do with all the test_matrix stuff?
You can artificially create it and fill with zeros.
Bharatendra Rai thanks sir, will try
I have one question, if you have created sparse matrix for train and test set then why are you using as.matrix for trainm in xgb.DMatrix? sparse matrix is also you can directly use. I am confused in xgb.DMatrix and before the step which is sparce.model.matrix.
Another question I have, what if your responce variable is in position of 43 not 1 then still you have to use -1 in sparse matrix.?
Thanks you so much for video its really nice but I have just questions depends on my dataset. Hopping for your reply. thanks.
For the 1st question, I would suggest try and see if it works. If it works then you are fine.
I didn't fully understand 2nd question. Are you referring to code line 43?
@@bkrai I appreciate your reply. For my data set if I use as.matrix on sparce.model.matrix than it was giving me an error. So, I am better using only sparce.model.matrix varibale directly in xgb.DMatrix. That is all clear now. you are getting mlogloss but I was getting merror. I used same parameters as yours.
Hi Sir, I have confusion @4:18 you have mentioned that put -1 because "Admit" is first column in dataset but according to this blog www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/ - “-1” removes an extra column which this command creates as the first column.
please confirm
You are right. Once 'admit' is there before ~ symbol, it is automatically out.
I am not able to get $evaluation_log in bst_model. Is there anything i am missing
Hi Sir, nice and very useful video sir I want to ask when I use XGBoost algorithm then I do not need to use linear and logistic regression?
I want to use XGBoost algorithm in this problem. www.kaggle.com/c/house-prices-advanced-regression-techniques
It's better to try more methods and then see wgich one performs better.
@@bkrai okay sir thanks.
@@bkrai Sir u r really great man.
Sir, can you please upload an video for Adaptive boosting in R. Thanks in Advance.
Thanks for the suggestion, I've added it to my list.
thnx so much.....
Thanks for comments!
Hi Thanks for the video. I have a problem I think. When I do feature importance I am getting the target column also with it. My target column is 'dismissed' and I put it the first column. This is how i am loading it.
train
I think lines 3 to 6 is not needed.
Sir, how we can optimize hyperparameters in the case of xgboost algo?
Refrr to this:
czcams.com/video/GmkHvDs0GG8/video.html