GLM in R

Sdílet
Vložit
  • čas přidán 13. 10. 2020
  • In this video we walk through a tutorial for Generalized Linear Models in R. The main goal is to show how to use this type of model, focusing on logistic regression, and talk a bit about why it's a good tool to know.
    The tutorial discusses both GLM and multilevel models, but the video has been split into two parts.
    github.com/ccs-amsterdam/r-co...
    We also have a more dedicated tutorial for GLM in R. It's best to view on the github html preview, but youtube messes up these links, so you'll have to find the link on this page under 'statistical analysis - generalized linear models'
    github.com/ccs-amsterdam/r-co...
    A great book on GLM (of which I'm not sure whether the digital version should be freely available, but hey, I just stumbled upon this pdf:
    www.utstat.toronto.edu/~brunne...

Komentáře • 30

  • @randomdude4411
    @randomdude4411 Před 25 dny

    This is a brilliant tutorial on GLM in R with a very good breakdown of all the information in step by step fashion that is understandable for a beginner

  • @djyi2174
    @djyi2174 Před 2 lety

    Thank you so much for the tutorial.

  • @philip_che
    @philip_che Před 3 lety

    Thank you for these videos!

  • @kariiamba7324
    @kariiamba7324 Před 2 lety

    Thankyou for this helpful video

  • @ammarparmr
    @ammarparmr Před 2 lety +2

    Very well explained !!! However, using the coefficients in the summary in my opinion is by far mush easier to understand than the way with tab model

    • @kasperwelbers
      @kasperwelbers  Před 2 lety +2

      Hi Ammar, sorry I missed this comment, but I would like to break a lance for odds ratios ;). Benefit of the log odds ratios is, I think, only that the sign corresponds to the effect direction. But the values are very hard to interpret. With odds ratios you can say things like "for a unit increase in x, the odds of y increase by a factor 2 (aka twice the odds)". Is there a benefit of using the log odds ratios that I'm overlooking?

  • @hm.91
    @hm.91 Před 2 lety

    Thank you!

  • @michellelaurendina
    @michellelaurendina Před 3 měsíci

    THANK. YOU.

  • @user-gd2yz3dj3b
    @user-gd2yz3dj3b Před 2 lety

    Hi Kasper, thank you for wonderful video. I have a question, which is about R2 and R2 adjusted of GLM models on R. How we can get R2 and R2 adjusted on R console? On my console, I can not find these values when I run a code “summary()”. Any specific code to get them on console?

    • @kasperwelbers
      @kasperwelbers  Před 2 lety +1

      Hi, great question! The thing is, there actually isn't a R2 or R2 adjusted for GLM. Instead, to evaluate model fit, it is more common to compare models (in the second link in the description, see logistic regression -> interpreting model fit and pseudo R2). There ARE, however, also some 'pseudo R2' measures, such as the R2 Tjur seen in the video. These measures try to imitate the property of R2 as a measure of explained variance. You'll never get these scores in the basic glm output though, because there are many possible pseudo R2 measures. But there are packages that implement them. For instance, the 'performance' package has an r2() function which calculates a (pseudo) r2 for different types of models.
      I'd also recommend reading about the model comparison approach though (if you don't know about it already), because journals often like to see this rather than or in addition to some pseudo R2.

    • @user-gd2yz3dj3b
      @user-gd2yz3dj3b Před 2 lety

      @@kasperwelbers Thank you so much for quick reply! It was really helpful and easy to understand:)
      One mor question! I will be conducting GLM in my master’s thesis. Which one would you recommend?
      1. Report AIC value (and I would write like “this model had the smallest AIC value)
      2. Try calculating pseudo R2 measures and report them

    • @kasperwelbers
      @kasperwelbers  Před 2 lety

      ​@@user-gd2yz3dj3b I'd actually recommend reporting Deviance AND some pseudo R2. The pseudo R2 is nice to help along interpretation, but deviance is more appropriate, and also provides a nice test to see if adding variables to a model provides a significant increase in fit. Say you have models of increasing complexity (i.e. adding variables): m0, m1 and m2. For glm's, you can then use: anova(m0, m1, m2, test = "Chisq"). In the ouput, the deviance column for the m1 row tells you how much deviance decreased compared to m0, and the pr(chi) column tells you whether this increase was significant (and same for m2 compared to m1). Alternatively, you could use sjPlot's tab_model and just add the AIC and/or deviance directly to the table: tab_model(m0, m1, m2, show.aic = T, show.dev = T).

    • @user-gd2yz3dj3b
      @user-gd2yz3dj3b Před 2 lety

      @@kasperwelbers Thank you so much, Kasper! I will try calculating deviance and pseudo R2 using the code you suggested :) Can I ask another question via email or something? I’m sorry to be a pain, but I think you can answer another big question I have🙇‍♂️

    • @kasperwelbers
      @kasperwelbers  Před 2 lety

      @@user-gd2yz3dj3b No problem! I do however prefer to keep questions based on these videos confined to youtube (and not too big). Especially at the moment with the whole corona teaching situation I'm swamped with emails, and I do need to prioritize my direct students. For bigger questions, I also do think it's best to find someone at your uni (ideally supervisor or someone in same department). Not only because they supposedly can invest more time, but also because in more specific problems there tend to be differences across disciplines / traditions in how to do statistics.

  • @MyPimpstyle
    @MyPimpstyle Před rokem

    Hi Kasper, what/how much does the intercept tells us in this case?

    • @kasperwelbers
      @kasperwelbers  Před rokem +1

      Good Question! It's similar to ordinary regression, in that it just means: the expected value of y if x (or all x-es in a multiple regression) is zero. This is mainly interpretable if there is a clear interpretation of what x=0 means. For instance, say your model is: having_fun = intercept + b*beers_drank. In that case, the intercept is the expected fun you have if you haven't had any beers.
      Now saw we have a binomial model. Our dependent variable is binary, namely whether or not a person had a hangover the day after a party. This time, the effect is more like (but not exactly, i'm ignoring the link function): hangover = intercept * b^beers_drank. Notice that ^ in b^beers_drank. Thats the multiplicative part: we expect that the odds of having a hangover increase by a 'factor of b' for every unit increase in beers. But whats most relevant for us now is that an exponent of zero is always 1! So b^0 (zero beers) is 1. So here as well, it means that when x is zero, the intercept is just our expected value.
      If we've transformed our coefficints to odds ratios, then if we haven't had any beers, the intercept would represent the odds that someone had a hangover. So if the intercept is 2, it would mean that the odds that someone who didn't have any beers has a hangover is 2-to-1, so a probability of 0.66 (odds of 2-to-1 means 2 people out of 3). That sounds weird, but they probably had whisky instead.
      I don't know how much that helped. The key takeaway is that like with ordinary regression, it's mainly interpretable if you have a clear idea of what x=0 means.

  • @954giggles
    @954giggles Před 2 lety

    Do you need to install any packages to run the glm code?

    • @kasperwelbers
      @kasperwelbers  Před 2 lety +2

      The glm function is in the stats package, which comes shipped with the basic R installation. So you dont necessarily need other packages. But in the tutorial I do use some packages for convenience, such as the sjplot package for making a regression table. If you run this without sjplot the results are the same, but you'll need to do some calculations yourself. For instance, logistic regression gives log odds ratio coefficients, so you'd need to take the exponent (exp function) to get the odds ratios. Tldr; you dont need to install packages, but it does make life easier

  • @audreyq.nkamngangk.7062
    @audreyq.nkamngangk.7062 Před 9 měsíci

    Thank you for the tutorial. Is it possible to create a glm model with a variable to explain which has 3 modalities

    • @kasperwelbers
      @kasperwelbers  Před 9 měsíci

      If I understand you correctly, I think it's indeed possible to model a dependent variable with a tri-modal distribution with glm. Actually, you might not even need glm for that. Whether a distribution is multimodal is a separate matter of the distribution family. A tri-modal distribution might be a mixture of three normal distributions, three binomial distributions, etc. Take the following simulation as an example. Here we create a y variable that is affected by a continuous variable x, and a factor with three groups. Since there is a strong effect of the group on y, this results in y being tri-modal.
      ## simulate 3-modal data
      n = 1000
      x = rnorm(n)
      group = sample(1:3, n, replace=T)
      group_means = c(5,10,15)
      y = group_means[group] + x*0.4 + rnorm(n)
      hist(y, breaks=50)
      m1 = lm(y ~ x)
      m2 = lm(y ~ as.factor(group) + x)
      summary(m1) ## bad estimate of x (should be around 0.4)
      plot(m1, 2) ## error is non-normal
      summary(m2) ## good estimate after controlling for group
      plot(m2, 2) ## error is normal after including group

  • @draprincesa01
    @draprincesa01 Před rokem

    how can i vizualized if some variables are factors like yes or no

    • @kasperwelbers
      @kasperwelbers  Před rokem

      I think sjPlot handles those pretty nicely! There's some great explanations on the website, under the regression plots tab: strengejacke.github.io/sjPlot/

    • @JT-ph3hk
      @JT-ph3hk Před rokem

      use the function str(yourbasename). If the variable is not yet a factor you can transform it using the following yourbasename$nameof the factor

  • @rubyanneolbinado95
    @rubyanneolbinado95 Před 2 měsíci

    Hi, why is R studio producing different results even though I am using the same call and data.

    • @kasperwelbers
      @kasperwelbers  Před 2 měsíci

      Hi! Do you mean vastly different results, or very small differences? I do think some of the multilevel stuff could in potential differ due to random processes in converging the model, but if so it should be really minor.

  • @DavidKoleckar
    @DavidKoleckar Před 5 měsíci

    nice audio bro. you record in bathroom?

    • @kasperwelbers
      @kasperwelbers  Před 5 měsíci

      Ahaha, not sure whether that's a question or a burn 😅. This is just a Blue Yeti mic in the home office I set up during the COVID lock downs. The room itself has pretty nice acoustic treatment, but I was still figuring out in a rush how to make recordings for lectures/workshops and it was hard to get clear audio without keystrokes hitting through.