Why multicollinearity is a problem | Why is multicollinearity bad | What is multicollinearity

Sdílet
Vložit
  • čas přidán 25. 07. 2024
  • Why multicollinearity is a problem | Why is multicollinearity bad | What is multicollinearity
    #MulticollinearityInRegression #UnfoldDataScience
    Hello ,
    My name is Aman and I am a Data Scientist.
    About this video:
    In this video, I explain in detail, Why multicollinearity is a problem , Why is multicollinearity bad and What is multicollinearity. I explain with example what is multicollinearity and what is problem if we have multicollinearity. in data. I also explain what is the way to handle multicollinearity in Data. Below topics are explained in this video:
    1. Why multicollinearity is a problem
    2. Why is multicollinearity bad
    3. What is multicollinearity
    4. Problems with multicollinearity
    5. Multicollinearity in regression model
    About Unfold Data science: This channel is to help people understand basics of data science through simple examples in easy way. Anybody without having prior knowledge of computer programming or statistics or machine learning and artificial intelligence can get an understanding of data science at high level through this channel. The videos uploaded will not be very technical in nature and hence it can be easily grasped by viewers from different background as well.
    If you need Data Science training from scratch . Please fill this form (Please Note: Training is chargeable)
    docs.google.com/forms/d/1Acua...
    Book recommendation for Data Science:
    Category 1 - Must Read For Every Data Scientist:
    The Elements of Statistical Learning by Trevor Hastie - amzn.to/37wMo9H
    Python Data Science Handbook - amzn.to/31UCScm
    Business Statistics By Ken Black - amzn.to/2LObAA5
    Hands-On Machine Learning with Scikit Learn, Keras, and TensorFlow by Aurelien Geron - amzn.to/3gV8sO9
    Ctaegory 2 - Overall Data Science:
    The Art of Data Science By Roger D. Peng - amzn.to/2KD75aD
    Predictive Analytics By By Eric Siegel - amzn.to/3nsQftV
    Data Science for Business By Foster Provost - amzn.to/3ajN8QZ
    Category 3 - Statistics and Mathematics:
    Naked Statistics By Charles Wheelan - amzn.to/3gXLdmp
    Practical Statistics for Data Scientist By Peter Bruce - amzn.to/37wL9Y5
    Category 4 - Machine Learning:
    Introduction to machine learning by Andreas C Muller - amzn.to/3oZ3X7T
    The Hundred Page Machine Learning Book by Andriy Burkov - amzn.to/3pdqCxJ
    Category 5 - Programming:
    The Pragmatic Programmer by David Thomas - amzn.to/2WqWXVj
    Clean Code by Robert C. Martin - amzn.to/3oYOdlt
    My Studio Setup:
    My Camera : amzn.to/3mwXI9I
    My Mic : amzn.to/34phfD0
    My Tripod : amzn.to/3r4HeJA
    My Ring Light : amzn.to/3gZz00F
    Join Facebook group :
    groups/41022...
    Follow on medium : / amanrai77
    Follow on quora: www.quora.com/profile/Aman-Ku...
    Follow on twitter : @unfoldds
    Get connected on LinkedIn : / aman-kumar-b4881440
    Follow on Instagram : unfolddatascience
    Watch Introduction to Data Science full playlist here : • Data Science In 15 Min...
    Watch python for data science playlist here:
    • Python Basics For Data...
    Watch statistics and mathematics playlist here :
    • Measures of Central Te...
    Watch End to End Implementation of a simple machine learning model in Python here:
    • How Does Machine Learn...
    Learn Ensemble Model, Bagging and Boosting here:
    • Introduction to Ensemb...
    Build Career in Data Science Playlist:
    • Channel updates - Unfo...
    Artificial Neural Network and Deep Learning Playlist:
    • Intuition behind neura...
    Natural langugae Processing playlist:
    • Natural Language Proce...
    Understanding and building recommendation system:
    • Recommendation System ...
    Access all my codes here:
    drive.google.com/drive/folder...
    Have a different question for me? Ask me here : docs.google.com/forms/d/1ccgl...
    My Music: www.bensound.com/royalty-free...

Komentáře • 162

  • @dariakrupnova6245
    @dariakrupnova6245 Před 2 lety +2

    Wow, I think I owe you my mark on the Econometrics final, you blew my mind, I had no idea it was so simple. Thank you!

  • @swatikute219
    @swatikute219 Před 3 lety +9

    If x1 and x2 are strongly correlated then we should check their individual correlation with target and will select the variable which is highly correlated with target and can also check p value for the variables.

  • @sangeethasaga
    @sangeethasaga Před 5 měsíci

    Never seen someone with such a clear understandable explanation...thank you so much!

  • @swatikute219
    @swatikute219 Před 3 lety +6

    Amazing pace, crisp word selection and good examples, thank you Aman for great videos !!

  • @jhonatangilromero2311
    @jhonatangilromero2311 Před 10 měsíci

    It is evident that a lot of work goes into developing these very informative videos. Thank you!

  • @koustavdutta1176
    @koustavdutta1176 Před 3 lety +16

    Firstly great explanation !! Now coming to your question, we have to check the bi-variate strength between dependent variables with independent variables. The independent variable with weakest strength should choose to remove from model

  • @sanjeevkmr5749
    @sanjeevkmr5749 Před 3 lety +17

    Thanks a lot for the detailed discussion on this topic. For the question asked in the video(Which feature to be removed incase of high correlation), I guess among the two, we have to remove the one which least contributes(less correlated) with the target variable. In that way, we will be able to preserve the feature which has high contribution.

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 lety +2

      Thanks Sanjeev. True.

    • @babareddy44
      @babareddy44 Před 3 lety +1

      How do we know which contributes least, help?

    • @arslanshahid3454
      @arslanshahid3454 Před rokem

      @@babareddy44 from R2, F- value or p- value?

    • @beautyisinmind2163
      @beautyisinmind2163 Před rokem

      @@babareddy44 you can use random forest model to see the significance of feature that contribute the most

  • @manavgora
    @manavgora Před 5 měsíci

    great, easily understandable

  • @shahbazkhalilli8593
    @shahbazkhalilli8593 Před 3 měsíci +1

    I don't know which one should I take. By the way video is great

  • @datafuturelab_ssb4433
    @datafuturelab_ssb4433 Před 3 lety +2

    Great explaination sir . Thanks for sharing and making my fundamentals strong

  • @KastijitBabar
    @KastijitBabar Před měsícem

    The best explaination on whole CZcams! Thank You.

  • @datapointpune6216
    @datapointpune6216 Před 3 lety +1

    Very Informative aman

  • @ChenLiangrui
    @ChenLiangrui Před měsícem

    awesome video! very clear and beginner friendly, no broken train of thought, very problem-focused

  • @roshinidhinesh5490
    @roshinidhinesh5490 Před 2 lety +1

    Such a great explanation sir.. Thanks a lot!

  • @zakiaa7464
    @zakiaa7464 Před 9 měsíci

    You are a genius. Thanks

  • @allaboutstat1103
    @allaboutstat1103 Před 3 lety +1

    thanks for clear explanation and God bless!

  • @ugwukelechi9476
    @ugwukelechi9476 Před rokem

    You are a great teacher! I learnt something new today.

  • @Bididudy_
    @Bididudy_ Před rokem +1

    Thank you for detailed explanation. I tried this concept from other channels but was bit difficult to get it. Your way of explaining terms is very simple and which helps to understand subject. Really glad that i visited your channel.👍

  • @muhammadaliabid5793
    @muhammadaliabid5793 Před 3 lety

    Thankyou for excellent explanation. I have fews questions please:
    1. I used Polynomial features method in sklearn and it significantly improved accuracy of my linear regression prediction model, but i found that the newly created features are correlated with the existing features since i created square and cubes! I understand as per your explanation that it will lead to multicollinearity problem! So i understand that the coefficients are not the true picture, However can i use this type of model for predictions?
    2. What would you suggest the threshold correlation value for multicollinearity?
    Thanks

  • @umeshrawat8827
    @umeshrawat8827 Před 9 měsíci +2

    To omit either X1 or X2, we can use PCA and remove the variable with low variance.

  • @suryadhakal3608
    @suryadhakal3608 Před 3 lety

    Great.

  • @samruddhideshmukh5928
    @samruddhideshmukh5928 Před 3 lety +4

    Simple, Clear and Amazing explanation!!!
    I think we can remove one of the columns seeing the p value. If p>0.05 then we fail to reject the Null hypothesis for that variable and thus that coefficient value will be equal to 0.Hence that variable will not contribute significantly.
    Sir pls do make a video on how to use Ridge-Lasso regression to handle multicollinearity.

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 lety +1

      Thanks Samruddhi,
      Videos u asked:
      czcams.com/video/7XvBwQeT9OI/video.html
      czcams.com/video/21TgKhy1GY4/video.html

  • @DataScience111
    @DataScience111 Před 3 lety +1

    best explanation....keep the good work up.

  • @abdulhaseebshah9109
    @abdulhaseebshah9109 Před rokem +1

    Amazing Explanation Aman, I have a question that VIF and auxiliary regression both use to detect multicollinearity?

  • @RamanKumar-ss2ro
    @RamanKumar-ss2ro Před 3 lety +1

    Great content.

  • @YourRandomVariable
    @YourRandomVariable Před 3 lety +1

    Hi Aman, What should we do when the constant term p-value is high? Mostly I see that people keep it without worrying about it. Could you please give an explanation for this?

  • @csprusty
    @csprusty Před 2 lety

    We can create and compare two models based on choosing each of the correlated explanatory variables one at a time and select the model having better R-squared value.

  • @shivamthakur4079
    @shivamthakur4079 Před 3 lety +1

    really loved sir what u said i can say that u have great idea of explaining concepts. i can blindly follow u sir

  • @prateeksachdeva1611
    @prateeksachdeva1611 Před rokem

    excellent explanation

  • @smegala3815
    @smegala3815 Před rokem +1

    Thank you sir... Best explanation

  • @harshadbobade2200
    @harshadbobade2200 Před 2 lety

    Simple and to the point explaination 🤘

  • @nurlanimanov9503
    @nurlanimanov9503 Před 3 lety

    Hello sir, After reading the comments I saw the answer to your question. They said we have to remove the one which has less correlation coefficient with the target variable due to the correlation matrix. It confused me at one point, Can we say that the coefficients in front of each feature that we get after running the regression model indicate us impact of each feature on the target? So, I mean can I take these coefficients when I decide which feature I have to remove bw two correlated features instead of taking correlation matrix value with the target variable? Can we say that the coefficients in front of each feature actually say the same thing as the value in the correlation matrix with the target variable in this context?

  • @bezagetnigatu1173
    @bezagetnigatu1173 Před 2 lety

    Thank you!

  • @sudheeshe1384
    @sudheeshe1384 Před 3 lety +1

    You always rocks :)

  • @shanmukhchandrayama8508
    @shanmukhchandrayama8508 Před 2 lety +1

    Aman, Your videos are great. But there are many videos which have some connection with other, so can you please make a video in which you can say which order to follow the playlists to learn the machine learning from basics. It would be really helpful😅

  • @hakimandishmand1068
    @hakimandishmand1068 Před 2 lety

    Good and perfect

  • @mariapramiladcosta1972

    Sir if the there are 3 predictors and one dependent variable. all the three independent variables are highly correlated then which type of regression model can be used. multiple regression can not be used rt?can we use the linear regression? can the tolerance of .1 and the VIF less than 10 not a good enough to indicate that there is no multicollinearity?
    for your question i think the one with weak correlated one to be removed

  • @arshiyasaba2259
    @arshiyasaba2259 Před 2 lety +1

    If value is less then thresholds value 0.5/0.7 as per the reference suggests. Then we can remove those values

  • @bhavanichatrathi7435
    @bhavanichatrathi7435 Před 3 lety

    Hi Aman it's very good explanation...please do video on penalised regression like lasso ridge and elastic..too much of mathematics into those please explain in simple way Thank you

  • @shafeeqaabdussalam6195
    @shafeeqaabdussalam6195 Před 3 lety +1

    Thank you

  • @sandipansarkar9211
    @sandipansarkar9211 Před 2 lety

    finished watching

  • @shadow82000
    @shadow82000 Před 3 lety +7

    If X1, X2 have high correlation, can I choose to drop the X with lower correlation to Y? Based on the correlation matrix

  • @ethiodiversity-1184
    @ethiodiversity-1184 Před 2 lety

    great explanation

  • @user-hy8me5cz2l
    @user-hy8me5cz2l Před 3 lety

    Sr ap ko js ne jo answr dia he sb ka answr correct he ap sb ko yes bol rhen hn

  • @faozanindresputra3096
    @faozanindresputra3096 Před rokem +1

    is multicollinearity will be problem too in correlations? just focus on getting which variables that correlate, not focus on regression. like in PCA

  • @ShubhamSharma-zb9uh
    @ShubhamSharma-zb9uh Před 3 lety

    09:11 The Data which More Coefficient Value that we have to consider for analysis.

  • @nivednambiar6845
    @nivednambiar6845 Před 4 měsíci

    Hi Aman, hope you are doing well !
    I want to ask one thing, what you are mentioning regression models is related to linear models right not the tree based regression models am i correct ?
    does multicolinearity effects the tree based models ?

  • @bijaynayak6473
    @bijaynayak6473 Před rokem

    which one will eliminate ? VIF of each features set the threshold >5

  • @ameerrace2284
    @ameerrace2284 Před 3 lety

    Great video. Please create video on python implementation of Lasso and ridge regression

  • @anmolpardeshi3138
    @anmolpardeshi3138 Před 3 lety

    regarding the question- which variable to remove out of a set of highly correlated variables? Can this be answered by PCA (principal component analysis)? or will the PCA weight them the same because they are highly correlated?

  • @ashulohar8948
    @ashulohar8948 Před rokem

    Please please make a vedio how to select drivers in linear regression which drive the sales

  • @RAJANKUMAR-mi1ib
    @RAJANKUMAR-mi1ib Před 2 lety

    Hi...Thanks for the nice explaination. Have a question that is multicollinearity a problem for linear regression only? if not then how its a problem for non-linear regression?

  • @sudhirnanaware1944
    @sudhirnanaware1944 Před 3 lety +1

    Hi Aman,
    As per my knowledge we can use VIF (Variation Inflation Factor) function, heatmap,Corr() function to remove the multicoliniarity. Please confirm another techniques

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 lety +1

      Yes Sushir, apart from some other regression techniques can be used.

    • @sudhirnanaware1944
      @sudhirnanaware1944 Před 3 lety

      Thanks Aman, may I know the regression techniques to remove multicoliniarity. so I will definitely learn this and it will helpful for me.

  • @prateeksachdeva1611
    @prateeksachdeva1611 Před rokem

    we will drop that feature from the model whose correlation with the dependent variable is lesser as compared to the other one

  • @nurlanimanov9503
    @nurlanimanov9503 Před 3 lety +1

    Hello sir! Firstly thank you for the video!
    I have 2 questions if you answer I will be glad:
    1) Can we say that we don't need to be concerned about correlated features in for example decision tree-based models? I mean do we need this concept only in linear-based models?
    2) Don't we need to touch correlated features when we use Lasso or Ridge regression is that true? Will the model do that by itself in that case? Don't we need to touch?

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 lety +1

      1. This is a problem with regression based models where coefficients come into picture.
      2.still you need to take care.

    • @hemanthkumar42
      @hemanthkumar42 Před 3 lety

      @@UnfoldDataScience from you first answer, then why multicollinearity is not a problem in neural network? Pls make a video regarding this sir...

    • @saurabhagrawal9874
      @saurabhagrawal9874 Před 2 lety +3

      @@hemanthkumar42 Note that multicollinearity does not affect prediction accuracy of the linear regression ,it only make the interpretation harder in the linear regression and mostly for interpretation we go to linear regression and when we go to neural network we already know its type of blackbox and we dont want to interpret ,but want good prediction results ,thats why we dont bother about multicollinearity in neural network

  • @jaheerkalanthar816
    @jaheerkalanthar816 Před 2 lety

    I think which variable highly CO relate with target variable

  • @sidrahms7458
    @sidrahms7458 Před 3 lety

    Awesome explanation, I have a question: if I have nominal,ordinal and continuous variables how can I find multicollinearity among them?

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 lety

      Hi Sidrah, answered.

    • @sidrahms7458
      @sidrahms7458 Před 3 lety

      I can't find your answer, I understand that we should use vif for continuous variables but what if I need to see correlation among all ordinal, numeric and nominal?

  • @atomicbreath4360
    @atomicbreath4360 Před 3 lety +1

    Sir can given some ideas on how to know which type of ml models is affected by multicollinearity?

  • @akhileshgandhe5934
    @akhileshgandhe5934 Před 3 lety

    Hi Aman, I have 9 categorical and 6 numerical columns and it's a regression problem.
    So I can find the correlation between numerical using correlation heatmap but how to find the relation between categorical..??
    Can I use chi square test..??
    If I use I am getting all 9 categorical are dependent on each other. So what should be my next step..??
    Please guide me.
    Thanks

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 lety +1

      Yes, chi square can be used, I have a dedicated video for the same topic.

  • @beautyisinmind2163
    @beautyisinmind2163 Před rokem

    can we remove highly negatively correlated features also or not? someone reply, please

  • @AMVSAGOs
    @AMVSAGOs Před 3 lety

    Great Explanation...
    At 7.50 you said "that's why we should not have multicollinearity in regression" . So, Is it okay if we have multicollinearity in classification?? Could you please make it clear..

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 lety +1

      When I say, it means regression family of Algorithms. Logistic regression also.

    • @AMVSAGOs
      @AMVSAGOs Před 3 lety

      @@UnfoldDataScience Thank you Aman Sir

  • @sriadityab4794
    @sriadityab4794 Před 3 lety +1

    Should we need to remove multicollinearity while building time series model?

  • @sharadpkumar
    @sharadpkumar Před 2 lety

    Hi Aman, nice work, keep it up.....i have a doubt that why normal distribution is so important? why we need our independent variable should show normal distribution for a good model? i am not finding a satisfying answer. can you please help?

    • @UnfoldDataScience
      @UnfoldDataScience  Před 2 lety +2

      Hi Sharad, in simple language, its easy for the model to learn pattern if you give examples from a large set of range.(That is your normal distribution).
      Take a example below:
      Predict salary of an individual(Y - target) based on his/her expense(X variable)
      Scenario 1 - in your training set you have Y as - 10LPA, 15LPA,20LPA, like that, here model wont be able to learn the pattern for 3LPA guys, may be there is difference is income/expense pattern for junior guys.
      Scenario 2 - You give many values of Y from all over like 2LPA, 4LPS,5LPA,100LPA, all values like they are normally distributed.
      Here its easy for model to learn pattern as it sees a range of values and the resulting model will be more reliable.
      Hope its clear now.

    • @sharadpkumar
      @sharadpkumar Před 2 lety

      @@UnfoldDataScience thanks for clarification . Does a huge dataset always show normal distribution?

    • @UnfoldDataScience
      @UnfoldDataScience  Před 2 lety +1

      No, not always...it depends on data

  • @trushnamayeenanda5431

    The independent variable with higher correlation among the similar factors should be removed

  • @squadgang1678
    @squadgang1678 Před rokem

    I will find the correlation between x1 and y and x2 and y individually and see which one is lesser the one with lesser correlation i will delete it

  • @kar2194
    @kar2194 Před 2 lety

    Sorry so it means when there is multicollinearity for example x2 and x3, so if I increase x2, x3 will automatically increased? Great video by the way!

  • @KumarHemjeet
    @KumarHemjeet Před 3 lety

    Remove that feature which is in less correlation with target.

  • @karthikganesh4679
    @karthikganesh4679 Před 3 lety

    Sir plz do the video for post pruning decision tree

  • @khoaanh7375
    @khoaanh7375 Před 4 měsíci +1

    this shit is pure gold

  • @user-hy8me5cz2l
    @user-hy8me5cz2l Před 3 lety +1

    Thnks sr g I think uncecessary variable remove

  • @kunalchakraborty3037
    @kunalchakraborty3037 Před 3 lety

    My question..
    1. Is multicollinearity a concern for predictive modeling. I mean the prediction is altered by neglecting this phenomenon or not.
    2. In case of GAM do we have to worry about multicollinearity.
    3. How collinearity inflates the variation.

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 lety

      Thanks Kunal for asking it. Answer to first question is prediction will not be impacted more however eoefficoents will be impacted.
      2nd and 3rd, I will. Cover in other video

    • @kunalchakraborty3037
      @kunalchakraborty3037 Před 3 lety

      @@UnfoldDataScience thanks 👍. Really appreciate your videos.

  • @datafuturelab_ssb4433
    @datafuturelab_ssb4433 Před 3 lety +2

    Remove the variable which have low impact on target variable...
    Sir I hv 2 question
    1. If there is multicollinearity in Classification problem. How to handle that
    2. What is VIF & how standardization done
    3. Can we use standard scaler in regression problem

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 lety +1

      There are three questions, I will cover them in separate video. Thanks for asking.

  • @omkarlokhande3692
    @omkarlokhande3692 Před 9 měsíci

    Sir what to do if the multi collinearity is affecting the binary classification problem

    • @UnfoldDataScience
      @UnfoldDataScience  Před 9 měsíci

      many ways to take care of it. I have discussed in classification videos.

  • @hemanthkumar42
    @hemanthkumar42 Před 3 lety +1

    Is multicollinearity is the problem for neural network?

  • @rohitnalage6366
    @rohitnalage6366 Před rokem

    Sir please explain Lasso and ridge if you made it,link pl.

    • @UnfoldDataScience
      @UnfoldDataScience  Před rokem

      czcams.com/video/7XvBwQeT9OI/video.html
      czcams.com/video/21TgKhy1GY4/video.html

  • @anirudhchandnani9917
    @anirudhchandnani9917 Před 3 lety

    Hi Aman,
    Could you please make a detailed video explaining the difference between Gradient Boost, AdaBoost and ExtremeGradientBoosting?
    Why is AdaBoost called adaptive? Is it only because it edits the weights of the misclassified instances? XGBoost and GradientBoost also are adaptive in that way, arent they?
    Also, why are XGBoost and Gboost more robust to outliers than AdaBoost despite all of them having a term of log in their loss functions?
    Would really appreciate your reply.
    Thanks

  • @salajmondal3437
    @salajmondal3437 Před 3 měsíci

    Should I check multicolinearty for classification problem?

    • @UnfoldDataScience
      @UnfoldDataScience  Před 3 měsíci +1

      For logistic regression - yes.

    • @salajmondal3437
      @salajmondal3437 Před 3 měsíci

      @@UnfoldDataScience Is it necessary to check multicollinearity between categorical features or numerical and categorical features??

  • @ahmad3823
    @ahmad3823 Před 4 měsíci

    at least two variables!

  • @rafibasha4145
    @rafibasha4145 Před 2 lety

    Multicolinearity is problem in classification as well right .@3:57

  • @sreejadas4417
    @sreejadas4417 Před rokem

    I want to be a data analyst but I want sequential courses from you please guide

  • @naziakhatoon3058
    @naziakhatoon3058 Před 3 lety

    Jo less Cor related ho usko remove karna hai

  • @sujithreddy1599
    @sujithreddy1599 Před 2 lety

    It depends on feature importance. the feature with less importance will be dropped.
    correct me if am wrong :0

  • @squadgang1678
    @squadgang1678 Před rokem

    Is Machine learning better than deep learning or deep learning better than machine learning

    • @UnfoldDataScience
      @UnfoldDataScience  Před rokem

      Depends on problem statement, data availability, Infra availability etc, can't say one is better then other

    • @squadgang1678
      @squadgang1678 Před rokem

      @@UnfoldDataScience oh ok got it ✌️

  • @tesfayesime9434
    @tesfayesime9434 Před rokem

    Neither x1 or x2