StatQuest: edgeR and DESeq2, part 2 - Independent Filtering

Sdílet
Vložit
  • čas přidán 4. 07. 2024
  • This explains edgeR and DESeq2's different approaches to filtering out genes with low read counts. The code mentioned is at statquest.org/statquest-filte...
    For a complete index of all the StatQuest videos, check out:
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Buying The StatQuest Illustrated Guide to Machine Learning!!!
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    CZcams Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    #statquest #deseq2 #edgeR

Komentáře • 52

  • @statquest
    @statquest  Před 3 lety +7

    NOTE: This video describes edgeR's method for filtering counts prior to version 3.28.1. The newer versions do things differently. However, since there are tons of publications and datasets based on the earlier versions, I still think this video is helpful to understand what is going on.
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @thomasmatthew7759
    @thomasmatthew7759 Před 5 lety +16

    Do you have any more videos about DESeq2? I'm trying to understand the role of the negative binomial distribution to their tests (independent Bernoulli Trials), but I'm having trouble understanding what each test is supposed to represent. Is each test just the difference in gene expressions between conditions?

  • @gcbicca
    @gcbicca Před rokem +1

    The website it's off! Thx so much for this lecture!

  • @reytns1
    @reytns1 Před 6 lety

    Hi Joshua, great explanation to take into account differences for EdgeR and DESeq2. I am Biologist with a medium statistical background and I am new in RNA-Seq Analysis. Uptoday I am working in plants (Table Grape-diploid). Could you tell What your RNASeq pipeline is? Did you work all analysis in R or maybe you do some trimming thing in other software related to Linux? Under your criteria how is the best way to start a RNA-Seq analysis?
    Thanks and I really appreciate that you share your knowedlge

  • @charade76
    @charade76 Před 6 lety +1

    2:47 - Why do you choose two different yet overlapping distributions in this simulation? What would be the effect if you just pick two distributions that do not overlap?

  • @taotaotan5671
    @taotaotan5671 Před 4 lety +6

    Hi, Josh. I was wondering if this filtering method is a little p-hacking... Say, in my initial analysis, I randomly choose a cutoff (quantile/ CPM) but didn't find the gene of my interest. Then I change the cutoff, and this results in more hits, which include the gene of my interest. Then I report that gene, which wasn't significant at the very beginning.

    • @statquest
      @statquest  Před 4 lety +3

      Excellent question. You're not the first person to suggest that this is a little "p-hackish" and it would be worth talking with Dr. Love about it. Maybe he did a ton of simulations and showed that it doesn't increase the false positive rate (and if not, then no worries!)

    • @taotaotan5671
      @taotaotan5671 Před 4 lety +1

      @@statquest Thanks for the clarification. I might post a forum on Bioconductor :)

  • @rheasingh6825
    @rheasingh6825 Před 7 měsíci +1

    hi josh! still wondering if you may ever follow up on the next parts of how do you calculate the average log2(FoldChange), and how the negative binomial model is used? there seems to be no great info available online on how these next steps are done, and any follow ups on these tutorials would be so helpful! or if you dont have the time for that, any links to resources that may be helpful in understanding the next steps?

    • @statquest
      @statquest  Před 7 měsíci +1

      It's unlikely that I'll ever follow up on this because I haven't done this sort of work in a long, long time and relatively few are interested in it. Oh well.

  • @iot3136
    @iot3136 Před 3 lety +1

    Hi Josh, Could you please create a youtube lesson/guideline on R for Bioconductor summarized experiments. Especially explaininng S4 SE data analysis. Would be really appreciated.

  • @lihuajiang1771
    @lihuajiang1771 Před 4 lety +1

    For the R code, could you let us know where we can download the dataset? Thank you.

    • @statquest
      @statquest  Před 4 lety

      Unfortunately I've moved on from that job and no longer have access to that data. However, you can find plenty of RNA-seq datasets on GEO: www.ncbi.nlm.nih.gov/geo/

  • @emojiman745
    @emojiman745 Před 2 lety +4

    Does part 3 exist? or part 4?

    • @statquest
      @statquest  Před 2 lety +5

      Unfortunately no. I originally wanted to go through all of DESeq2, but not many people are interested in this topic. :(

    • @emojiman745
      @emojiman745 Před 2 lety +11

      @@statquest You are wrong. There are so many of us out there :(

    • @chusty93
      @chusty93 Před rokem +1

      @@statquest I am!

    • @user-rf9ow8ck2l
      @user-rf9ow8ck2l Před 3 měsíci

      I'm interested!@@statquest

  • @kakusniper
    @kakusniper Před 6 lety

    Is there any difference between the order of normalization and then filtering, as shown in your code ? In some examples, I have seen filtering with at least one count per million (cpm) (in at least n samples) done before normalization.
    Also, why filtering was done in different way (second largest cpm value for each gene), as shown in the code vs rowSums(cpm(dataset)>1) >= n (Usually this step was done before normalization)?

    • @kakusniper
      @kakusniper Před 6 lety

      Comments explaining filtering steps would have been great :). Thanks.

  • @user-rf9ow8ck2l
    @user-rf9ow8ck2l Před 3 měsíci

    I love this video, incredibly helpful. My only criticism is, what is meant by "bogus tests"? Are these tests with samples drawn from the same distribution?

  • @motasimmasood3248
    @motasimmasood3248 Před rokem +1

    Amazing!!! one question, what happens next? You have normalized read counts, you got rid of the outliers now how do you calculate the average log2(FoldChange) and lfcSE, i.e., the ERROR.

    • @statquest
      @statquest  Před rokem

      I wish I had more time to follow up on these videos. That was the original plan, but there wasn't enough interest in them.

    • @motasimmasood3248
      @motasimmasood3248 Před rokem +6

      @@statquest Not enough interest??? my friend I'm afraid you are deeply mistaken. There is simply no explanation available whatsoever of this phenomenon for the ordinary folks (anywhere else)... I've had this question asked at least 1000 times and I'm not even a bioinformatician. If you ever get some time, you have no idea how huge a favour you would be doing to the humanity because people end up spending thousands and thousands of £ on very bad experiments which end up in the bin...and time and talent wasted... only because they can neither plan the experiments properly nor interpret results for lack of this very crucial knowledge..

  • @Ankhelz
    @Ankhelz Před 2 lety

    I have a brief and quite obvious question. This "p-value" adjustment (FDR) is done after a simple t-test is performed between both treatment for each gene (along the whole dataframe)?
    Thanks!
    Great video, I love your content. (wish i could buy a shirt, but no shipping to Argentina) :)

    • @statquest
      @statquest  Před 2 lety +1

      The FDR adjustment is done after we calculate every single p-value from the tests. For details, see: czcams.com/video/K8LQSvtjcEo/video.html

    • @Ankhelz
      @Ankhelz Před 2 lety

      @@statquest what is the test performed? A t-test?

    • @statquest
      @statquest  Před 2 lety +1

      @@Ankhelz I can't remember exactly, but I believe it is a likelihood ratio test based on the negative binomial distribution. However, you can just think of it as a fancy t-test.

  • @manuelsokolov
    @manuelsokolov Před rokem

    Can DESeq be used when the data is in the TPM format?

  • @tinacole1450
    @tinacole1450 Před 2 lety

    Do you know anything about awk commands? I am trying to get the frequency of all lines of DNA input from a txt file? For example, if pattern GCGCTTAATA is within a list 10 times, I need for the read out to list the 10 as a frequency. I am using AWK

    • @tinacole1450
      @tinacole1450 Před 2 lety

      actually, I misunderstood my question: I am matching fasta files to txt doc with specific motifs. Pretty sure I will be using a for loop. Let me do some digging. It's simple and I have done it before but forgot the code. Also, I executed it in R

    • @statquest
      @statquest  Před 2 lety

      Unfortunately I don't know AWK off the top of my head. It's something I have to re-learn every time I use it.

    • @tinacole1450
      @tinacole1450 Před 2 lety +2

      @@statquest it was a grep command
      grep -Ff file2 file1 >output.txt
      So, gave up on awk but I am sure it can be useful in a for loop

    • @tinacole1450
      @tinacole1450 Před 2 lety +1

      So...figured it out

    • @statquest
      @statquest  Před 2 lety

      @@tinacole1450 bam!

  • @sebastiangerety8726
    @sebastiangerety8726 Před 4 lety

    Could you explain this statement? 19:55 "If none of the raw values goes above the threshold, then no filtering is done." I thought it would be the opposite, if none of the CPMs for any of the genes is BELOW the threshold, then no genes are filtered out. Thanks!

    • @statquest
      @statquest  Před 4 lety +4

      I think the problem is that "threshold" and "cutoff" refer to two different things and you might be confusing one for the other. The "threshold" refers to the tangent line at the top of the curve (minus the standard deviation), and the "cutoff" refers to the first datapoint that touches the threshold. In other words...
      1) We fit a curve to the data.
      2) we calculate the standard deviation between each data point and the curve
      3) We find the y-axis coordinate for the peak of the curve.
      4) We subtract the standard deviation from this y-axis coordinate and that is the "threshold".
      5) Going from left to right (from low quantiles to high quantiles), the first datapoint that touches the threshold becomes the cutoff. So, if the first datapoint to touch the threshold corresponds to 0.1, then the 10% of the genes to the left are filter out.
      Does that make sense?

  • @fahdqadir6212
    @fahdqadir6212 Před rokem

    staaaaaaatqueeeeaaaaaaaast...... Gotta be honest at this point I just come back to these videos for the first 1min. Also I came here to say, I always pronounced it HOCHH-berg, wheras you pronounce it HOOOCH-berg, is that correct or an inside joke? Just curious, great job otherwise your videos got me through grad school.

    • @statquest
      @statquest  Před rokem

      Glad you like the tunes! I have no idea how it's pronounced. Someone once told me, but I have since forgotten. :(

  • @blackV199
    @blackV199 Před 2 lety

    Can anyone answer me this: What happens if I don't use Independent Filtering ? how does this affect my results ?

    • @statquest
      @statquest  Před 2 lety

      My guess is that you'll get fewer significant results.

    • @blackV199
      @blackV199 Před 2 lety +2

      @@statquest you're a legend, seriously.

  • @thomasmatthew7759
    @thomasmatthew7759 Před 5 lety +15

    Do you have any more videos about DESeq2? I'm trying to understand the role of the negative binomial distribution to their tests (independent Bernoulli Trials), but I'm having trouble understanding what each test is supposed to represent. Is each test just the difference in gene expressions between conditions?