Machine Learning in Python: Building a Linear Regression Model

Sdílet
Vložit
  • čas přidán 31. 03. 2020
  • In this video, I will be showing you how to build a linear regression model in Python using the scikit-learn package. We will be using the Diabetes dataset (built-in data from scikit-learn) and the Boston Housing (download from GitHub) dataset.
    🌟 Buy me a coffee: www.buymeacoffee.com/dataprof...
    📎CODE: github.com/dataprofessor/code...
    ⭕ Playlist:
    Check out our other videos in the following playlists.
    ✅ Data Science 101: bit.ly/dataprofessor-ds101
    ✅ Data Science CZcamsr Podcast: bit.ly/datascience-youtuber-p...
    ✅ Data Science Virtual Internship: bit.ly/dataprofessor-internship
    ✅ Bioinformatics: bit.ly/dataprofessor-bioinform...
    ✅ Data Science Toolbox: bit.ly/dataprofessor-datascie...
    ✅ Streamlit (Web App in Python): bit.ly/dataprofessor-streamlit
    ✅ Shiny (Web App in R): bit.ly/dataprofessor-shiny
    ✅ Google Colab Tips and Tricks: bit.ly/dataprofessor-google-c...
    ✅ Pandas Tips and Tricks: bit.ly/dataprofessor-pandas
    ✅ Python Data Science Project: bit.ly/dataprofessor-python-ds
    ✅ R Data Science Project: bit.ly/dataprofessor-r-ds
    ⭕ Subscribe:
    If you're new here, it would mean the world to me if you would consider subscribing to this channel.
    ✅ Subscribe: czcams.com/users/dataprofessor...
    ⭕ Recommended Tools:
    Kite is a FREE AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I've been using Kite and I love it!
    ✅ Check out Kite: www.kite.com/get-kite/?...
    ⭕ Recommended Books:
    ✅ Hands-On Machine Learning with Scikit-Learn : amzn.to/3hTKuTt
    ✅ Data Science from Scratch : amzn.to/3fO0JiZ
    ✅ Python Data Science Handbook : amzn.to/37Tvf8n
    ✅ R for Data Science : amzn.to/2YCPcgW
    ✅ Artificial Intelligence: The Insights You Need from Harvard Business Review: amzn.to/33jTdcv
    ✅ AI Superpowers: China, Silicon Valley, and the New World Order: amzn.to/3nghGrd
    ⭕ Stock photos, graphics and videos used on this channel:
    ✅ 1.envato.market/c/2346717/628...
    ⭕ Follow us:
    ✅ Medium: bit.ly/chanin-medium
    ✅ FaceBook: / dataprofessor
    ✅ Website: dataprofessor.org/ (Under construction)
    ✅ Twitter: / thedataprof
    ✅ Instagram: / data.professor
    ✅ LinkedIn: / chanin-nantasenamat
    ✅ GitHub 1: github.com/dataprofessor/
    ✅ GitHub 2: github.com/chaninlab/
    ⭕ Disclaimer:
    Recommended books and tools are affiliate links that gives me a portion of sales at no cost to you, which will contribute to the improvement of this channel's contents.
    #dataprofessor #regression #linearregression #scikit #scikitlearn #sklearn #prediction #jupyternotebook #jupyter #googlecolab #colaboratory #notebook #machinelearning #datascienceproject #randomforest #decisiontree #svm #neuralnet #neuralnetwork #supportvectormachine #python #learnpython #pythonprogramming #datascience #datamining #bigdata #datascienceworkshop #dataminingworkshop #dataminingtutorial #datasciencetutorial #ai #artificialintelligence #tutorial #dataanalytics #dataanalysis #machinelearningmodel
  • Věda a technologie

Komentáře • 125

  • @adir9290
    @adir9290 Před 3 lety +35

    Hello Prof, I want to thank you for putting together training videos like this one. I have learned more than i have in the last 2 months of my data science MSc programme. You explained every line of code, every symbol and the reason behind every style of coding, that is what is called knowledge impartation. Thank you very much.

  • @akbaraliotakhanov1221
    @akbaraliotakhanov1221 Před 3 lety +14

    Thanks so much, this is what a Linear Regression actually is and how we apply it into our dataset.
    Pls also make videos about how applying Logistic Regression, KNN, Random Forest, SVM, Naïve Bias , Decision Trees using Python into our dataset.
    Very interesting and clear

  • @edpalen5295
    @edpalen5295 Před 4 lety +7

    every video you have posted provides value to the audience. Outstanding job. I hope your channel could grow exponentially, as it is deserved.

    • @DataProfessor
      @DataProfessor  Před 4 lety +1

      Thank you Edwin for the encouraging words 😃

    • @lucusp
      @lucusp Před 3 lety +1

      @@DataProfessor Might seem silly but building a model on YT growth would be interesting :)

  • @dca374
    @dca374 Před 3 lety +9

    This was great thank you so much! Really useful and looking forward to using it in my research.

  • @HealthOnMyMind
    @HealthOnMyMind Před 3 lety +2

    This is exactly what I was looking for, thank you so much this was such a big help!!!

  • @zoro8117
    @zoro8117 Před 2 lety

    your way of describing is really helpful to me. Thanks a lot for your videos.

  • @ramblingman4733
    @ramblingman4733 Před 3 lety +3

    You are the best professor for explaining , thanks for your content!

  • @CapitanJusticia
    @CapitanJusticia Před 3 lety +11

    Nice video, but how do we interpret the results? IOW, what would be the deliverable to our stakeholders? What are the actual predictions?

  • @marcofestu
    @marcofestu Před 4 lety +2

    Glad to have more of your video to watch than usual 😍

    • @DataProfessor
      @DataProfessor  Před 4 lety +1

      Thanks Marco, glad to hear that😃

    • @DataProfessor
      @DataProfessor  Před 4 lety +1

      @wise guy That's a great question. I might make a future video dedicated to this topic. In the meantime, there are several other linear models that can be computed by scikit-learn package.
      scikit-learn.org/stable/modules/linear_model.html
      The coeff and intercept can be summarized below:
      Y = m1*x1 + m2*x2 + .... + mn*xn + b
      where Y is the dependent variable
      x1, x2, ..., xn are the independent variables
      m1, m2, ..., mn are the regression coefficients
      b is the Y-intercept
      Some more about the b value, it is the value where the regression model line passes the Y-intercept. Also, the coefficients tell us the relative importance of the independent variables.

  • @ensyw5971
    @ensyw5971 Před 3 lety +1

    Wonderful presentation! Though Im struggling a bit with the loss function and the training/iteration principle. How does this work exactly?
    For your first example using the diabetes dataset, I would like to train the data/iterate the data 1000 times, and thereby plot the loss function over a 2-dimensional grid at every 100, 300, 700 and 1000 iteration. How exactly would you do this? Thank you!

  • @abdullahsaeed3437
    @abdullahsaeed3437 Před 2 lety

    11:50 what does the graph is showing? just dots? what do these dots mean?

  • @sanjaypandey6586
    @sanjaypandey6586 Před 2 lety

    is it ok in linear regression(single variable) if dependent and independent variable are not normally distributed if not what should be the optimum solution for negative skew and neg kurtosis

  • @iuliatomescu134
    @iuliatomescu134 Před 2 lety +2

    awesome tutorial!! thank you!

  • @ayo4757
    @ayo4757 Před 2 lety

    hi! Why you dont use standarscaler for the features? is not necesary ??

  • @RealThrillMedia
    @RealThrillMedia Před rokem

    Very well explained, thank you!

  • @dr.navidsoltani4326
    @dr.navidsoltani4326 Před 3 lety +1

    If one trains on 100% of the data (skipping split/train/test), does the sklearns lin/logreg-implementaiton basically become the same 'classic' implementation as statsmodels or glm (in R)?

  • @akshatkumarjain
    @akshatkumarjain Před 3 lety

    can u tell the coffecient are giving us the weight value
    what is weight values here?

  • @iswinternear
    @iswinternear Před rokem

    Thank you! this video was such a big help

  • @titiQd
    @titiQd Před 2 lety +1

    hi @Data Professor, I can i ask for minutes (9.50 - 10.00) when you explain about modulo operator. So i confused with the 0.523810833536016 where is that number come from? i keep repeating and repeating your video but still don't get where that float number comes up. at moment i do some assignments/ project and use your YT tutorial as guidance for me grasp this linear regression. thank you

  • @Data_Man
    @Data_Man Před 10 měsíci

    Just found your channel. Thank you from a fellow 🇹🇭

  • @mahathomorogo5625
    @mahathomorogo5625 Před 2 lety

    Bless your heart data professor

  • @michaeloladunjoye5258
    @michaeloladunjoye5258 Před 4 lety +3

    Beautiful presentation. Thank you sir.

  • @barbaramenesesvega8494
    @barbaramenesesvega8494 Před 3 lety +2

    wonderful!!! thank you very much to share this video :D

  • @RM-lb7xw
    @RM-lb7xw Před 3 lety +3

    Great video, looking forward to more such videos like these. Also, can you tell me what R2 score tells us about the model?

    • @DataProfessor
      @DataProfessor  Před 3 lety +3

      Thanks for watching. R^2 is also known as the goodness-of-fit and it tells the relative performance of a regression model. It is computed from the actual and predicted values whereby a value approaching 1 suggests good performance.

  • @emersoncarlospedrino6381
    @emersoncarlospedrino6381 Před 2 lety +2

    Thank you. Excellent explanation! :)

  • @_GayatriShetkar
    @_GayatriShetkar Před 2 lety +1

    Can we call it a multiple regression model?
    As we're predicting a value considering multiple parameters

  • @srinivasmalvadkar1825
    @srinivasmalvadkar1825 Před 3 lety +3

    this was realy helpful and wonderful of all other videos......
    thank you so much sir

  • @sadhnamall8475
    @sadhnamall8475 Před 6 měsíci +1

    Thank you so much sir🙏🙏

  • @FredericBiondi
    @FredericBiondi Před 4 lety +2

    So excellent. Thank you so much

    • @DataProfessor
      @DataProfessor  Před 4 lety

      Thanks Frederic for the comment and kind words!

  • @tommytan8571
    @tommytan8571 Před 4 lety +2

    I am lazy to comment usally, but this video is very delicious . Keep up with the good work , just subscribed.

    • @DataProfessor
      @DataProfessor  Před 4 lety

      Welcome to the channel, it is certainly nice to hear that, thanks for the kind words 😃

  • @ektasingh6284
    @ektasingh6284 Před 3 lety +2

    Thank you for making this video. It is very helpful. 👍

    • @DataProfessor
      @DataProfessor  Před 3 lety +1

      It’s a pleasure, glad it is helpful 😆

    • @ektasingh6284
      @ektasingh6284 Před 3 lety

      Could you please explain what does Root mean square error (Root-MSE) tell us about the model? Somebody explained to me that larger the gap between R^2 and Root mean Square error, the better the model is at predicting the effect of independent variables on the output. But the question is, how much gap is good enough? Or is there a better interpretation of Root-MSE?

  • @dreamphoenix
    @dreamphoenix Před 3 lety +2

    Fantastic video, thank you.

  • @nationhlohlomi9333
    @nationhlohlomi9333 Před rokem +1

    Thank you 👨‍🏫 prof

  • @matthewjaworski4115
    @matthewjaworski4115 Před 3 lety +1

    Hello. I am confused about what data is being held in X_train and Y_train. I have only done linear regression with 2 variables before and I am confused about why a 353x10 matrix is being held in X_train and why a 353x1(?) matrix is being held in Y_train. Is Y_train a placeholder for 353 regression line y values that get produced after the 10 variable coefficients are calculated and made into a function? Or is the algorithm solving an overdetermined system of 353 equations with 10 unknowns using linear algebra: (y1=b0 + b1x1...) . . . (yn=b0 + bnxn...)?

    • @fulton123
      @fulton123 Před 2 lety

      X_train and Y_train holds 80% of the input data ( Refer to data split section).
      X_train is a 404*10 matrix because it has 80% of the input data which gives you the 404 rows * it has all the 13 features (except the Y or 'medv' that was dropped).
      Y_train is the 404*1 matrix to hold the Y values ('medv' column). This will be used to train the model for Y to make it predict Y_pred later on.
      @Data Professor

  • @maruf5943
    @maruf5943 Před 4 lety +2

    Thank you, sir, for making this so easy :)
    #HappyLearning

    • @DataProfessor
      @DataProfessor  Před 4 lety

      Thanks for watching and glad it was helpful 😃

  • @angelaluchi1008
    @angelaluchi1008 Před 2 lety

    thank you for the video. I have a question: what is the difference between print(diabetes.DESCR) or only diabetes.DESCR ? thank you

  • @gkalyankumar1263
    @gkalyankumar1263 Před 2 lety +2

    Great tutorials

  • @alcidesrivarola5390
    @alcidesrivarola5390 Před 3 lety +2

    Thank you for great video! The explanation is very clear. By the way what software do you use to make the videos?

  • @CSAura
    @CSAura Před rokem

    i really learned alot from this video !! most amazing data Professor ever !! i was just wondering, that in my case i only need to compare 5 Machine learning algorithm and from a data set that is worldwide like CICIDS2017 or KDD, could you please post a video about it ???? that would be amazing if possible, thank you so much

  • @CodeGeeks-dz1ro
    @CodeGeeks-dz1ro Před 5 měsíci

    ขอบคุณครับ

  • @lisitashamatutu1140
    @lisitashamatutu1140 Před 2 lety +1

    thanks professor

  • @AnotherproblemOn
    @AnotherproblemOn Před 3 lety +1

    Thank you so much

  • @abdullahhatem8057
    @abdullahhatem8057 Před 4 měsíci

    thank you

  • @yong2happy
    @yong2happy Před 2 lety +1

    สุดยอดครับอาจารย์

  • @jackjohn6532
    @jackjohn6532 Před 3 lety +1

    Is using r2 a bad evaluation metric for linear regression? If the r2 value is really bad (like above one or like 10%) does that mean the model is not useful or is it still useful?

    • @DataProfessor
      @DataProfessor  Před 3 lety +1

      Typically, the rule of thumb that I and other researchers use is anything above 0.6 for a training and above 0.5 for test sets are considered to be really good in terms of performance. As for anything lower, it may mean that the model has not capture the X-Y relationship, sometimes exploring feature engineering may help. Hope this helps.

  • @jongcheulkim7284
    @jongcheulkim7284 Před 3 lety +1

    Thank you ^^

  • @Mayglie
    @Mayglie Před 3 lety

    hi prof, may I ask ,last stage of scatterplot for boston house model , so x axis is represent the y_test value and y_axis is represent the y_pred value? How do i evaluate from the scatterplot. Could you explain more on plt representation. thank u sir!

    • @DataProfessor
      @DataProfessor  Před 3 lety +1

      Hi Mayglie, for sure, I have written a Medium article in Towards Data Science where one of the section takes a look at the explanation of the Python codes line by line for making the scatter plot. I also drew an infographic (towards the end of the article) explaining this at a high-level, you can check out the article at towardsdatascience.com/how-to-build-a-regression-model-in-python-9a10685c7f09
      Hope this helps 😃

    • @Mayglie
      @Mayglie Před 3 lety

      Thank u professor... i will take a look!

  • @ShoaibKhan-ok4iu
    @ShoaibKhan-ok4iu Před 2 lety +3

    Perfectly well put together videos. Just a little request about the linear regression model performance part can you elaborate a little bit what those numbers really mean. is this model good or bad?

    • @DataProfessor
      @DataProfessor  Před 2 lety +1

      Hi, thanks for the feedback. Pearson's correlation coefficient (R) can be in the range of 0 and 1 where the higher the number the better the results (for correlation between predicted and actual values). In a nutshell, high (good) and low (bad).

    • @ShoaibKhan-ok4iu
      @ShoaibKhan-ok4iu Před 2 lety

      @@DataProfessor thanks huge fan of your work

  • @Borzacchinni
    @Borzacchinni Před 2 lety +2

    Very good video!

  • @caiofernandeschavesmaximia6816

    Greetings from Brazil professor
    Im a beginner in data analysis and i´d like to know if there´s any difference about turning the dataset into a dataframe, and if yes, why?
    Tnks

    • @DataProfessor
      @DataProfessor  Před 3 lety +2

      Hi, by datasets perhaps you are referring to the files on your computer such as in CSV format which needs to be read into Python using pandas and converted into a data frame. Such data frames can then be used by machine learning packages such as scikit-learn for model building.

  • @nachomacho7027
    @nachomacho7027 Před 2 měsíci

    perfect

  • @abdullahsaeed3437
    @abdullahsaeed3437 Před 2 lety

    there are 10 vars in X-Data, 1 in Y (which is obvious)
    how these 10 vars can b represented as linear function of Y??
    is it really linear regression?

  • @xueqiu946
    @xueqiu946 Před 3 lety

    Hi Professor, I followed the exact same steps as you, but my coef and intercept are different, do you know why? By the way, great presentation.

    • @DataProfessor
      @DataProfessor  Před 3 lety

      Thanks for watching. The difference in value is due to the random seed. If a seed number is set to be the same then the same values should be obtained.

  • @LazedMusic
    @LazedMusic Před 4 lety +1

    What is the purpose of the train test split function?

    • @DataProfessor
      @DataProfessor  Před 4 lety +2

      It is to allow us to split the data to a train subset and test subset. The train subset is used for model building and applied on the test subset to make a prediction. The purpose of data splitting is to allow us to assess whether the constructed model will perform well on new, unseen data. I've also written a Medium article with illustration at towardsdatascience.com/how-to-build-a-machine-learning-model-439ab8fb3fb1

    • @LazedMusic
      @LazedMusic Před 4 lety +1

      @@DataProfessor wow bro! Thank you so much!

    • @DataProfessor
      @DataProfessor  Před 4 lety

      A pleasure, glad it was helpful 😁

  • @bhankit1410
    @bhankit1410 Před 4 lety +1

    Hello DataProfessor. I am a beginner in ML and have learned some basic concepts of linear/logistic regression, SVM, ANN, Recommendation systems, Anomaly Detection from ML course by Andrew NG on Coursera. I am looking for some good walkthrough videos like these for picking up libraries like sklearn, tensorflow, etc. Do you have a set of videos that could help me?
    P.S - The walkthrough was amazing. thanks for the content.

    • @DataProfessor
      @DataProfessor  Před 4 lety +3

      Thanks Ankit for the comment and kind words. Currently, I created a Playlist called the "Python Data Science Projects" available at czcams.com/video/XmSlFPDjKdc/video.html where I give a walkthrough tutorial on using scikit-learn package to solve various problems in machine learning. I've also started to create beginner friendly videos in the "Python Programming 101" playlist available at czcams.com/video/6UcWs33Xti0/video.html. Thanks for the suggestion, I'm also looking to expand into additional ML packages in Python.

    • @bhankit1410
      @bhankit1410 Před 4 lety +2

      @@DataProfessor: thank you. I really appreciate your help. will go through the playlists.😊😊

    • @DataProfessor
      @DataProfessor  Před 4 lety +1

      @@bhankit1410 It's a pleasure 😃

  • @linyerin
    @linyerin Před 2 lety +1

    great video

  • @Luuckx2
    @Luuckx2 Před 4 lety +1

    nice!

    • @DataProfessor
      @DataProfessor  Před 4 lety

      Thanks Lucas! If you find value in the video, could you give it a Like. Thanks!

  • @adiflorense1477
    @adiflorense1477 Před 3 lety

    8:29 is the coefficient the same as the weight?

    • @DataProfessor
      @DataProfessor  Před 3 lety +1

      HI, yes the regression coefficients can be said to tell us the relative weight or magnitude by which the variable contributes to the calculation of Y.

    • @adiflorense1477
      @adiflorense1477 Před 3 lety

      @@DataProfessorSir, are these weights the same as for improving the accuracy of the Naive Bayes algorithm?

  • @sriramvaidyanathan5094

    How can you get biological activity data from
    Which are the good database for IC50
    How search any rule of thumb
    Kind help anyone please

  • @salikmalik7631
    @salikmalik7631 Před 3 lety +1

    Hi data professor.. Can you suggest me a book of machine learning which should I buy as a beginner?

    • @DataProfessor
      @DataProfessor  Před 3 lety

      Hi Salik, I have a couple of recommended books that I normally include in the video description, here they are (includes affiliate link). The Hands-On book is definitely a must read, it is really all you need to get started and beyond, though the Python Data Science Handbook (a free version is available online by the author, let me find the link and post below) is also a great read as well.
      Recommended Books:
      🌟kit.co/dataprofessor
      ✅ Hands-On Machine Learning with Scikit-Learn : amzn.to/3hTKuTt
      ✅ Data Science from Scratch : amzn.to/3fO0JiZ
      ✅ Python Data Science Handbook : amzn.to/37Tvf8n
      ✅ R for Data Science : amzn.to/2YCPcgW
      ✅ Artificial Intelligence: The Insights You Need from Harvard Business Review: amzn.to/33jTdcv
      ✅ AI Superpowers: China, Silicon Valley, and the New World Order: amzn.to/3nghGrd

    • @DataProfessor
      @DataProfessor  Před 3 lety

      The free online version for Python Data Science Handbook is available at jakevdp.github.io/PythonDataScienceHandbook/

    • @salikmalik7631
      @salikmalik7631 Před 3 lety

      @@DataProfessor Thanks for your reply.
      I have first edition of Hands-On Machine Learning.
      Is it any difference among 1st and second edition.

  • @3_anisaanggraeny404
    @3_anisaanggraeny404 Před 2 lety

    Hi prof, i want u to know if this video very helpful for me, thank you

  • @josephozurumba
    @josephozurumba Před rokem

    I really do feel that most people just post videos for posting's sake. Most datasets in real life will have characters as values, that need to be converted using encoder, because ML does not use objects for prediction, but floating numbers. Please can someone help with a video of how I can build a model from a dataset with character values?
    Thank you Professor, well explained.

  • @animelover5093
    @animelover5093 Před rokem

    T.T , I still couldn't figure why need a linear regression.
    I think i need to read more !!
    I'm bad a maths ~~~

  • @DrRandyDavila
    @DrRandyDavila Před 3 lety +1

    Got a quick question, are you a professor? I ask because I'm a prof of data science and would wanna chat.

    • @DataProfessor
      @DataProfessor  Před 3 lety

      Technically, I'm an Associate Professor of Bioinformatics, I can be reached at hellodataprofessor@gmail.com

  • @jordanhensiek3882
    @jordanhensiek3882 Před 4 lety

    7:25 MAE and others

  • @indahmustikarahayu1584
    @indahmustikarahayu1584 Před 3 lety +1

    #scikitlearn

  • @nikhilsannat5429
    @nikhilsannat5429 Před 2 lety +1

    he looks like Mature version of Zoma

  • @DataProfessor
    @DataProfessor  Před 4 lety +1

    If you find value in this video, please give it a Like 👍and Subscribe ❤️if you would like to see more Data Science videos.

  • @thennarasuthen9179
    @thennarasuthen9179 Před 3 lety +1

    Please zoom in a bit professor. Thank you for the video..

    • @DataProfessor
      @DataProfessor  Před 3 lety +1

      Noted, for future videos I have zoomed in on the screen. Thank you for the suggestion. 😊

    • @thennarasuthen9179
      @thennarasuthen9179 Před 3 lety +1

      @@DataProfessor Thank you...

  • @atharvparlikar8765
    @atharvparlikar8765 Před 2 lety +2

    damn this guy looks a lot like joma

  • @HFCosta83
    @HFCosta83 Před 2 lety

    FRAJOLA

  • @DilpreetSingh-sw3ei
    @DilpreetSingh-sw3ei Před 3 lety

    Good for only imitation purposes and nothing useful for applying to our own project.

    • @DataProfessor
      @DataProfessor  Před 3 lety

      Thanks for the feedback, this video is meant for beginners. There's a playlist showing its application to a bioinformatics project here czcams.com/play/PLtqF5YXg7GLlQJUv9XJ3RWdd5VYGwBHrP.html

  • @alexmichelii6797
    @alexmichelii6797 Před 2 lety

    Great video, from statistics import LinearRegression did not work for me
    i had to use from sklearn.linear_model import LinearRegression to make it work