StatQuest: DESeq2, part 1, Library Normalization

Sdílet
Vložit
  • čas přidán 8. 06. 2024
  • DESeq2 is a complicated program used to identified differentially expressed genes. Here I clearly explain the first thing it does, normalize the libraries.
    There is an error at 9:28: I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement.
    For a complete index of all the StatQuest videos, check out:
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Buying The StatQuest Illustrated Guide to Machine Learning!!!
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    CZcams Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    Correction:
    9:28 I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together.
    #statquest #rnaseq #deseq2

Komentáře • 144

  • @statquest
    @statquest  Před 4 lety +15

    Correction:
    9:28 I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together.
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @deni9264
    @deni9264 Před 6 lety +27

    Public Disclaimer: Watching Josh's introductory vids on RNAseq analysis, including this video, (sometimes more than once ;-) ) is a useful primer if you're just starting off in RNA-seq analysis. Watching these videos helped me make sense of the RNA-seq DE pipeline, such as the nature of the inputs and the rationale of the methods and metrics.

    • @statquest
      @statquest  Před 6 lety

      Awesome!!! Thanks for the endorsement! :)

  • @sureshkumar-kx2xz
    @sureshkumar-kx2xz Před 4 lety +11

    I am Neuroscientist from MIT with no previous background in RNAseq and molecular biology. This video summarized Deseq2 in 12 minutes which is super cool!!! I quickly understand deseq2 in 12 mins

  • @vanya.antonov
    @vanya.antonov Před 7 lety +1

    I had hard time understanding from the original paper how the normalization coefficients are computed. This video helps a lot! Thank you!

  • @jonathanjavid7206
    @jonathanjavid7206 Před 3 lety +1

    Well Explained!!! These steps makes the more clear vision about the DEG by DeSeq2. Thank you Josh for this valuable video.

  • @ElNick09
    @ElNick09 Před 5 lety +7

    These videos are an absolutely fantastic resource. Really, thank you so much!

    • @statquest
      @statquest  Před 5 lety +1

      I'm so happy to hear that you like them! :)

  • @Mortezakhabiri1
    @Mortezakhabiri1 Před 7 lety

    The way that you are explaining is amazing! I was looking for such a explanations for a long time. It is very comprehensive. Thanks!

    • @Mortezakhabiri1
      @Mortezakhabiri1 Před 7 lety

      Great! Great! Great!
      It would be great also if you can introduce some well done books!
      Thanks!

    • @Mortezakhabiri1
      @Mortezakhabiri1 Před 7 lety

      I could not find any, too!
      Thanks!

  • @footboro
    @footboro Před 7 lety +1

    Very nice tutorial, effortless stat learning.
    thank you Joshua

  • @Seeeevi
    @Seeeevi Před 3 lety +3

    I have to do a talk on deseq2 for a data analysis course and this is what I'm starting it off with. Thank you a lot, seriously.

  • @deni9264
    @deni9264 Před 6 lety +3

    Thank you so much Josh!! Your explanation has helped overcome my anxiety about learning DeSeq2...it's complex but not so bad. Thaaank you. I'm sharing this vid with my comp.bio journal club, too

    • @deni9264
      @deni9264 Před 6 lety

      Do you plan on doing an overview of limma-voom as well?

    • @statquest
      @statquest  Před 6 lety

      A lot of people have started to ask about that. I"ll put it onto the to-do list and look into it.

  • @MrZanvine
    @MrZanvine Před 6 lety +3

    Hey Joshsua, thanks so much for making these videos- they are immensely helpful.
    I think I noticed a small mistake when you transform the median values into normal numbers. Sample 2 you have e^0.3, but the median is -0.1.

  • @mrlolzot
    @mrlolzot Před 4 lety +1

    Great stuff dude. Thanks for making this.

  • @dhkwnr97
    @dhkwnr97 Před 6 lety

    It`s really helpful for my research THANK YOU A LOT!

  • @williammo4450
    @williammo4450 Před 4 lety +2

    I like this guy! Thanks for your carefully explanation! Keep it up!

  • @jeremyjacobsen6400
    @jeremyjacobsen6400 Před rokem +1

    I wrote a python script to do this procedure and found what might be an error because of your tutorial. In particular I was filtering inf(s) before calculating the median of the logs. Thanks Josh!

  • @haitrieuphan3832
    @haitrieuphan3832 Před 6 lety

    Thank you so much for very useful videos

  • @AnnaJeanine
    @AnnaJeanine Před 7 lety

    Another amazing video!

  • @stevebarratt888
    @stevebarratt888 Před rokem +1

    such a great explanation!

  • @aealarco
    @aealarco Před 7 lety

    Thank you very much, it was very useful

  • @muffinman1
    @muffinman1 Před 4 lety +1

    Fantastic explanation.

  • @bzaruk
    @bzaruk Před 3 lety

    First of all - I LOVE your stuff! so helpful and clear!
    quick question though - I have an RNA-Seq of some experiments for 4 different cell-lines, each cell line has 3 biological replicates with 3 technical replicates each - I want to do some normalization on that RNA-Seq results to compare between the cell lines.
    You mentioned in the video that DESeq wasn't meant to do normalization between different reads count but between different cells - which is exactly what I am doing - BUT - I do have some delta between the reads of each technical replicate, especially between the 1st biological replicate against both the 2nd and the 3rd biological replicates due to different PCR cycles.
    My question is - do I need to perform any kind of normalization based on the reads before I do the DESeq normalization?

    • @statquest
      @statquest  Před 3 lety +1

      Nope! At at 4:18 we see that DESeq2 (and EdgeR) can normalize take care of both situations - when there are differences in library sizes and when there are differences in library composition.

  • @maharshichakraborty3530
    @maharshichakraborty3530 Před 5 lety +2

    Great video! Would have been nice if you could have talked about the negative bionomial distribution fitting

    • @statquest
      @statquest  Před 5 lety +1

      One day I'll get to that part. Hopefully soon.

  • @congchen170
    @congchen170 Před 7 lety +23

    Found a small mistake from the video: when you explain the library sizes (around 2 minutes), the Sample #2 Gene A2M read counts should be 1126, not 2126.

  • @zuhaibahmed6817
    @zuhaibahmed6817 Před 4 lety +2

    Thanks for you videos! They really are a huge help. I just have a question about your explanation for differences in library composition at 3:41. I'm not sure I follow. The way I see it, if those 563 reads don't map to A2M, they aren't going to just move onto other genes to inflate their counts. So the only reason that the other genes in library 2 have higher counts is because they had more reads that matched their sequence, indicating that their transcripts were more abundant. Which would mean those other genes are differentially expressed as well, right? If only A2M was differentially expressed, then those other genes would retain their small counts because they aren't transcribed any more than in library 1. Am I misunderstanding something? Thanks
    Edit: I have two other questions as well, if you don't mind:
    1) Does this method of normalization take into account the lengths of the different transcripts like TPM/RPKM/FPKM?
    2) Is this method more robust than TPM/RPKM/FPKM? If so, then should it be used in instead of them?
    Sorry for the onslaught of questions. Thanks for the help!

    • @statquest
      @statquest  Před 4 lety

      In the example at 3:41, there are 635 reads sequenced per sample (yes, these numbers are small compared to a true RNA-seq experiment, but this is just an example). Now, when we do RNA-seq, we extract the mRNA from cells (or a single cell) and then we amplify it with PCR before making the final library that is sequenced. The PCR ensures that we have a lot of stuff to sequence, so much stuff that there is more than we can actually sequence. Thus the example plays out in reality the way it does in this example in the video. When one gene soaks up a lot of reads in one sample, but not in another, then that just means there are more reads going to other genes in the other sample.
      This method does not account for read-lengths, nor should it. DESeq2's model depends only on the number of reads per gene, not the lengths.
      Lastly, TPM/FPKM/etc. are useful when just looking at the data and comparing genes of different lengths.

    • @zuhaibahmed6817
      @zuhaibahmed6817 Před 4 lety

      @@statquest Thanks for the clarification

  • @tinacole1450
    @tinacole1450 Před 3 lety +1

    Your explanations are very good. Thanks !!! The song is funny

  • @katherinemedinaortiz1935

    These videos are awesome

  • @lactobacillusacidophilus

    One question. Deseq2 uses negative binomial regression, so after applying scaling factors, does it also round the normalized numbers to make a real count table of normalized values? Otherwise can we use negative binomial still?

  • @alfred532008
    @alfred532008 Před 7 lety

    Do you have a video explaining more technical aspects of DESeq2, pleas? e.g. how the GLM fitting (eq. 2 in DESeq2 paper), estimation of dispersion, and estimation of logarithmic fold changes.

  • @Reonsi
    @Reonsi Před 2 lety

    As a summary, I would say that the geometric average downplays the effect of outliers at the gene level (rows), while the median downplays outliers at the sample level (column). The subtraction allows us to rank samples by sequencing depth, and the division applies the scaling factor to our original data.

  • @gauss238
    @gauss238 Před 7 lety

    Please post part 2 soon.

  • @andydavidson3097
    @andydavidson3097 Před 4 měsíci +1

    Request: great video on DESeq2 normalization. We already know what the counts are how. I do not understand how the linear models for each genes is used to calculate the lfc? I really appreciate your expliantions

    • @statquest
      @statquest  Před 4 měsíci +1

      Thank you! Unfortunately I haven't done this sort of analysis in a long time so I can't promise I'll follow up on it. :(

  • @adrichuuu
    @adrichuuu Před 6 lety +1

    Thank you very much!

  • @apulunuj
    @apulunuj Před 4 lety +1

    Also, what would be implemented if you wanted to look at log infinity values? that is cell type specific genes @7:57

    • @statquest
      @statquest  Před 4 lety +1

      You can always add a "pseudo-count" to the data, like one read for all genes, so that you can avoid the log infinity problem.

  • @mayling1014
    @mayling1014 Před 2 měsíci

    Thank you so much for the great explanation!
    2:08 May I know if all the samples were sequenced at the same time ( same sequencing reaction), will the sequencing depth become different?

    • @statquest
      @statquest  Před 2 měsíci

      I believe so, because you'll still end up with different numbers of reads per sample.

    • @mayling1014
      @mayling1014 Před 2 měsíci

      @@statquest Does this imply that even if the sequencing depth is standardized to 20x coverage across all samples, the number of reads corresponding to transcripts of gene A may still vary between samples, even if the expression level of gene A is the same in both sample 1 and sample 2?

    • @statquest
      @statquest  Před 2 měsíci

      @@mayling1014 I believe there is a stochastic (random) nature to the hybridization between the reads and the chip used for sequencing. So there is a chance that not every sample gets exactly the same number reads because not every sample binds to exactly the same number of spots on the chip. And not every read is the same quality, and that could also result in different numbers of reads per sample after you filter out low quality reads.

  • @CaveCrack
    @CaveCrack Před 3 lety

    Josh, thanks for your wonderful series of videos. I have a question about using the DESeq2 normalization method on TPM data. I have TPM from RSEM output, each sample of course sums to 1 million. It seems that using DESeq2 style normalization on this TPM data would be valuable as it will adjust for library composition. I am not using R, so I'm not using the DESEq2 bioconductor package, just computing the normalization as you describe. Documentation on the DESeq2 package says the counts should be raw counts, however it seems that TPM would be just as valid if normalization is the only step of interest. Is this correct? thanks

    • @statquest
      @statquest  Před 3 lety

      DESeq2's normalization assumes the data are raw because it does part of what TPM attempts to do, compensate for sequencing depth differences. When you start with TPM values, DESeq2 can no longer make that adjustment the way it wants to.

    • @godsperson5571
      @godsperson5571 Před 3 lety

      @@statquest so is it good or bad?

  • @nishantshade668
    @nishantshade668 Před 2 lety

    The scaling factor which you mentioned at 4:46, is it the same as the work done by the 'Estimate Size Factor' function in R programming??

    • @statquest
      @statquest  Před 2 lety

      Unfortunately it's been so long since I used DESeq2 that I can't remember.

  • @alfred532008
    @alfred532008 Před 7 lety

    Is there any obvious reason for using geometric mean instead of arithmetic mean when calculating a scaling factor?

  • @Aviad3587
    @Aviad3587 Před 2 lety +1

    i

  • @Moominverdatre
    @Moominverdatre Před 3 lety

    Thanks for the great video. What you call "scaling factor" is the output of the function estimateSizeFactors, right? The name is a little bit misleading for someone who's already very confused with all the different normalisation methods!

  • @hommejuhyun
    @hommejuhyun Před 5 lety

    Thank you for your good explanation ! Umm.. So,, Deseq2 is only use to find a moderately expressed gene in different tissue, right?

    • @statquest
      @statquest  Před 5 lety +1

      DESeq2 can find differentially expressed genes among different tissues, or within the same tissue if, for example, one is diseased and the other is healthy.

    • @hommejuhyun
      @hommejuhyun Před 5 lety +1

      @@statquest Oh, I see !! Thank you your good example,,
      I have one more question :)
      Is there any called name Deseq2 normalization value like TPM, RPKM?

    • @statquest
      @statquest  Před 5 lety

      @@hommejuhyun Not that I know of.

  • @apulunuj
    @apulunuj Před 4 lety

    In regards to the samples for each DESeq analysis. could that be different biological replicates or does each sample correspond to a different cell type ?

    • @statquest
      @statquest  Před 4 lety

      It could be anything - it could be technical replicates, biological replicates or different cell types. Whatever it is you want to study.

  • @lycz9869
    @lycz9869 Před 2 lety

    Around 12:20 you say that the idea of logs and median is to look at house keeping genes and to eliminate all genes which are only transcribed in one sample. But why should we do this? If we knock out a transcription factor to find its function this is exactly what we are interested in. Or does this method serve a different purpose?
    Thank you!

    • @statquest
      @statquest  Před 2 lety +1

      At this stage, all we are interested in is normalizing the read counts to compensate for differences in sequencing depth and library composition. Later, once the read counts are normalized, then we will use statistics to identify differentially expressed genes.

  • @fantasy6611
    @fantasy6611 Před 5 lety +1

    Another small mistake I found is that, around 10 mins, sample#2 should be e^-0.1=0.9. Anyway thamks a lot!

  • @tomy34188
    @tomy34188 Před 3 lety

    So if you want to investigate cell differentiation using RNA-seq data, would it be wise to apply DESeq2? Because non-house keeping genes would also be of interest here I assume and those would be filtered out with DESeq2 or am I mistaken?

    • @statquest
      @statquest  Před 3 lety

      Yes, I think DESeq2 would be a good tool for that.

  • @shixiangwang
    @shixiangwang Před 5 lety +1

    Thanks.

  • @leixiao169
    @leixiao169 Před 2 lety

    Thanks for the really helpful video! If DEseq2 removes genes that have 0 reads, does this affect results interpretation? For example, different tissues express different genes (in some tissues the expression of certain genes is 0), for some "0" expression genes in certain tissues, the difference between these tissues and the tissues in which these genes are highly expressed is physiologically relevant. I hope the program still keeps these "0" read genes.

    • @statquest
      @statquest  Před 2 lety +1

      Yes, it keeps those genes (with 0 reads), however, those genes are not used to calculate the scaling factor.

    • @leixiao169
      @leixiao169 Před 2 lety

      @@statquest Thanks Josh, if DEseq2 keeps those genes with 0 reads, that is possible that those genes with 0 reads will be listed as significantly differentially expressed genes in the volcano plot, do I understand right?

    • @statquest
      @statquest  Před 2 lety +1

      @@leixiao169 Presumably.

  • @vigneshparasuraman
    @vigneshparasuraman Před 5 lety

    Can anyone help me in normalizing excel data in deseq2? Where can i find the clear script ?

  • @bzaruk
    @bzaruk Před 2 lety

    how would you do a differential expression between multiple cell lines? do them in pairs and then find the shared highly differentially expressed genes? or is there a way of doing it in one analysis?

    • @statquest
      @statquest  Před 2 lety +1

      This is a good question. Unfortunately it's been a while since I used DESeq2, however, I remember that you can pretty much do any sort of "linear model" type test, so you should be able to do anova or something like that.

    • @bzaruk
      @bzaruk Před 2 lety +1

      @@statquest Thanks! appreciate it!

  • @manuelsokolov
    @manuelsokolov Před rokem +1

    If I have data already in TPM (transcripts per million), can I still apply DESEQ2?

  • @nastiaskuba8773
    @nastiaskuba8773 Před 4 měsíci

    There is a problem at 2:13, reads for A2M gene in sample 2 should have 1126 reads, not 2126. Anyway, thank you for the video, very useful for beginners, and in general nice and unique style!

    • @statquest
      @statquest  Před 4 měsíci

      Sorry for the typo, but I'm glad it didn't get in the way of you understanding the ideas. BAM! :)

  • @xuxiaochenwu9376
    @xuxiaochenwu9376 Před rokem

    Hi, I notice a mistake @10:37, for sample #2, e should be raised for -0.1 instead of -0.3. Correct me if I am wrong.

  • @jyoti9426
    @jyoti9426 Před 3 lety +1

    It's the intro song for me! \m/

  • @aoihana1042
    @aoihana1042 Před 6 lety

    A 1000 Likes! Thank you Josh! 😭😭🙏🙏🙏

  • @bzaruk
    @bzaruk Před 2 lety

    DESeq2 with only one replicate for each group - is it possible? if not, is there any good alternative to detect differential gene expression for one replicate per cell line?

    • @statquest
      @statquest  Před 2 lety +1

      I'm almost certain it can. I know I've done it with EdgeR before. The manual for EdgeR gives an example and tells you how to set certain parameters that are usually estimated when you have more data. Presumably you can do something similar with DESeq2.

  • @lealemler2967
    @lealemler2967 Před 3 lety

    Thank you very much. On which paper is this based?

    • @statquest
      @statquest  Před 3 lety

      The original DESeq2 manuscript.

    • @lealemler2967
      @lealemler2967 Před 3 lety

      @@statquest Thank you but there are many DESeq2 papers, do you mean this one: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Michael I Love, Wolfgang Huber and Simon Anders 2014? Thank you

    • @statquest
      @statquest  Před 3 lety +1

      Yes, that's the one. I also went through the code to see exactly what it was doing.

    • @lealemler2967
      @lealemler2967 Před 3 lety +1

      @@statquest thank you very much!! :) :)

  • @kiarashbike
    @kiarashbike Před 3 lety

    Hey, sorry is it what DEseq and Vst function are doing in DEseq2 package?

    • @statquest
      @statquest  Před 3 lety

      I'm not sure I understand your question. Can you rephrase it?

    • @kiarashbike
      @kiarashbike Před 3 lety

      @@statquest Oh, sorry I had to explain that a bit more specific. When we want to run codes in R using DEseq2 packages for analysing RNAseq data, we have to do data transformation using Vst (variancestabilizing transformation) function. In this vidoe, you explained nicely what DEseq2 does for normalizing RNAseq data. I'm asking whether this normalization is doing the same as what Vst function does in R?

    • @statquest
      @statquest  Před 3 lety

      @@kiarashbike I believe VST is different.

  • @johirislam8174
    @johirislam8174 Před 3 lety

    I want have some other quaries regarding DEG analysis.I want to compare two datasets differentially expressed gene ,how can i do that.For example one data set contain 108 DEG and the other contain 70 so i want to see the common gene between this two dataset.So how can i do that and how can i make the vaan diagram between them.Moreover i saw some GEO dataset there are some file format tsv and txt.Son in that case how can i analyse that kind of file.Plz solve this two problem to me.

    • @statquest
      @statquest  Před 3 lety

      I'll keep those topics in mind.

  • @fmetaller
    @fmetaller Před 5 lety +1

    Hi, Do you know a DESeq2 alternative in Python?

    • @statquest
      @statquest  Před 5 lety +1

      Unfortunately I don't know of anything like DESeq2 for python.

    • @fmetaller
      @fmetaller Před 5 lety +1

      I'm an undergraduate medical student that wants to get into bioinformatics. I spent the last months learning python and reading books like python data science handbook, Elegant Scipy, Think Stats. For what I see, it seems to me that I can do everything you showed in python but I'd appreciate your opinion.
      In order to build a career as a bioinformatic would you suggest me to keep investing on python or to switch to R?

    • @statquest
      @statquest  Před 5 lety +1

      This is a great question! If you really want to do bioinformatics, and specifically genomic bioinformatics, than you'll want to have access to the Bioconductor tools - those are all in R. If you want to do more machine learning stuff, Python is probably a better fit. The good news, however, is that once you learn one programming language, learning another isn't that bad. I use both languages pretty frequently.

  • @sunnetinternationalbusines9910

    So if log takes away our differential counts, how do we know differential genes amongst two different samples. for us developmental scientists, we always like to see which gene is uniquely responsible for one character and hoe to confirm it by tracing it in the laboratory with knockouts and knockings. Its like DESEQ2 defeats that. And I have been using it for my data analysis from time.

    • @statquest
      @statquest  Před rokem

      What time point, minutes and seconds, are you asking about?

    • @sunnetinternationalbusines9910
      @sunnetinternationalbusines9910 Před rokem

      @@statquest from 4,19. What i mean is that some times, these differences in library composition are what we actually looked out for. For instance, if we wish to identify unique transcription factors in a tissue type, we look out for the differences in library composition of the two tissue types. IF DESEQ2 adjusts for these differences by silencing them, how do we know which receptor, or TFs or chemokines are uniquely expressed at a articular time or in a articular tissue type. Thanks man you are the best.

    • @statquest
      @statquest  Před rokem

      @@sunnetinternationalbusines9910 DESeq2 doesn't "silence" those regions - it simply does not use them when adjusting for differences in library composition and depth.Those genes remain in the dataset, and are normalized just like all the others, but are not part of the pool of genes used to calculate the normalization factor.

  • @garyhokawai
    @garyhokawai Před 7 lety

    Averages calculated with logs are called "geometric averages"? I suppose the geometric mean is defined as the nth root of the product of n numbers, i.e., for a set of numbers x1, x2, ..., xn. Then in step2, I guess you were just calculating the arithmetic mean of the read counts with logs of each gene across all the samples.

    • @garyhokawai
      @garyhokawai Před 7 lety

      I see. Seems that definitions in programming are not always the same as in mathematics. I see the formula in DESeq2 paper, it's mathematics. However, in practice, it's not. Still need to learn~

  • @Zonno5
    @Zonno5 Před 2 lety

    I thought geometric average was defined as the nth root of the product of all the samples, not the average of of the log of all samples. It could be a roundabout way to do the same thing I haven''t checked.

    • @statquest
      @statquest  Před 2 lety

      There is an error at 9:28: I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together.

  • @michelepierotti2833
    @michelepierotti2833 Před 4 lety +2

    Average of logs is not the same as log of averages! Around 9:19 you are saying log(reads for geneX) - log(average for geneX) = log of the ratio, correctly. But what you calculated in step 2 is not the log(average for gene X) but the average of the log(reads). If a, b, c were the read counts for the 3 samples for say GENE3, the average you calculated in the example step 2 is (loga +logb + logc)/3. This, in your example is Average of log reads. But when you go on to discuss the logratio you are treating it as the log(average), the log of [ (a+b+c)/3], i.e. the log of the average. These 2 quantities are not the same thing obviously, So either you are wrong in the example at step 2 or you are wrong later when you treat it as a log(average) while you had calculated the average of logs. Could you help clarify and ideally correct the example in the video?

    • @statquest
      @statquest  Před 4 lety

      You are correct. This error had been noted before in the video's description, and now I have made pinned comment so that it is easier to see. Sorry for the confusion.

    • @michelepierotti2833
      @michelepierotti2833 Před 4 lety +1

      @@statquest Thanks for clarifying and doing it so fast.

    • @michelepierotti2833
      @michelepierotti2833 Před 4 lety

      @@statquest"log(reads for gene X) - average(log values for gene for gene X)." Then the interpretation in the box is false and we should ignore that, too, right? You have no difference of logs, so no log of ratio, so not true that "we are really checking out the ratios of the reads in each sample to the average across samples".

    • @michelepierotti2833
      @michelepierotti2833 Před 4 lety

      so how do we move from the corrected expression: "log(reads for gene X) - average(log values for gene for gene X)" to the next step where we are working with "log (ratio reads_for_gene_X / average_reads_for_gene_X)". What am I missing?

    • @statquest
      @statquest  Před 4 lety

      @@michelepierotti2833 You don't do that next step. We don't have a ratio, we just have a difference, or a "residual", from the geometric mean.

  • @taotaotan5671
    @taotaotan5671 Před 4 lety +1

    Wait... Michael Love is your colleague right...

    • @statquest
      @statquest  Před 4 lety +1

      Yes, are pals. However, I left UNC a few months ago to do StatQuest full time.

    • @taotaotan5671
      @taotaotan5671 Před 4 lety +1

      StatQuest with Josh Starmer You guys are wonderful! Michael is very active and helpful in Bioconductor forums. Thank you guys for great video and software.

  • @ahmadzaimhilmi
    @ahmadzaimhilmi Před 5 lety +1

    I come here only for the intro song

  • @ALEJANDRARODRIGUEZ-nj5uu

    Can this DESeq2 normalization be used if the data is from groups of experiments run in different labs or sequencing platforms?

    • @statquest
      @statquest  Před 3 lety

      Yes, however, you may need to also compensate for batch effects.