StatQuest: edgeR, part 1, Library Normalization
Vložit
- čas přidán 2. 04. 2017
- edgeR, like DESeq2, is a complicated program used to identify differentially expressed genes. Here I clearly explain how it normalized libraries.
For a complete index of all the StatQuest videos, check out:
statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - statquest.gumroad.com/l/wvtmc
Paperback - www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - www.amazon.com/dp/B09ZG79HXC
Patreon: / statquest
...or...
CZcams Membership: / @statquest
...a cool StatQuest t-shirt or sweatshirt:
shop.spreadshirt.com/statques...
...buying one or two of my songs (or go large and get a whole album!)
joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
#statquest #rnaseq #edger
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
By the way, you did a Great job to explain in a very nice way stastical analysis for dummies!!
Thank you! :)
wow.. this is amazing method, and your explanation
This is so helpful! Thank you! Keep up the good work!
Thanks for the in-depth explanation
bam!
Dear Joshua, I already see your video, it is really interesting and helpful for new people that are involved in this RNAseq world. Well I have a question related to normalization. Are there any relation between EdgeR with hypergeometric distribution ?
This is really great! I'm a little bit confused, don't people use some conserved genes that have a relative steady expression level as references to normalize their data?
This is amazing thanks
Hooray! :)
Learning R and differential analysis for ChIP-seq differential analysis (DiffBind), THANKS!!!
bam! :)
Thank you!
:)
The reference sample could be one of the treatments or one of the controls in one RNA-seq experiment , is it correct?. Thank you for your great explanation
Yes.
This is an explanation of the process executed in TMM normalization, as made clear at 10:37. I'm just saying this in case anyone has come to this video, as i have, looking for an explanation of TMM normalization.
Yep.
This is great, thank you. I don't understand how did you calculate the weighted average? Is that just the average of log-ratios? "12:28"
I'll be honest, I made this video a while ago and haven't thought about it much since, so I can't give you any more details about how edgeR works.
the weights are calculated by the inverse of the approximate asymptotic variances (calculated using the delta method)
Love you!!
Thanks!
Just wonder, comparing edgeR to DESeq2, which one makes more sense for single cell rna seq normalization?
So if my data has a large number of zero-value genes, DESeq2 is preferable? BTW, usually I would use ERCC spike ins for the size factor calculation and apply it the endogenous ones.
Is there any reason for edgeR to use the 75th quantile instead of the median to pick the reference sample?
Very nice video to understand edgeR.
I think the point is to just exclude outliers with excessive read counts.
Ty
Do you have suggestions on whether someone should use edgeR or DeSEQ2 for 16S analysis of soil communities?
To be honest, they are about the same. However, I know Mike Love is still adding tons of new visualizations to DESeq2, so that might be my favorite.
I'm trying to think of a reason why I shouldn't just compare the case-control distributions with: KS test pval (y axis cutoff 0.05) over difference in normalized means (x axis cutoff +/- 50 TPM). We want to know if they come from the same distribution and don't want to tiny TPM changes.
Unfortunately it's been way too long since I made this video or did any kind of bioinformatics work to give you a reasonable answer. However, my rough memory is that these methods (edgeR and DESeq2) gain power by pooling genes to estimate variation, and then gain more power by using a parametric test based on the negative binomial distribution. I think if you just went with a straight KS test, you wouldn't have any power.
edgeR just seems far more complicated than DESeq2. Is there any advantage edgeR has over DESeq2, apart from the artistic signature you mentioned towards the end? :P
Not that I know of. I used to use edgeR, but switched to DESeq2 with no regrets.
Hello! Is possible make a association between environment variable and bacteria abundance? Sorry for my english!
I have no idea. Maybe someone else can help.
So you mean that EdgeR need o weighted trimmed mean normalization, but DEseq2 do not?
DESeq2 has it's own normalization that is similar, but a little different. Here's the link to my StatQuest that describes the method: czcams.com/video/UFB993xufUU/video.html
How to calculate the weights to calculate the weighted log2 ratios in this library
What time point in the video, minutes and seconds, are you asking about?
12:20 the weights that are assigned how they are calculated
@@suryakantastat0275 I believe edgeR uses the number of reads per gene in each sample to calculate the weighted average of the log values. For example, if we had two genes: Gene A, with 100 reads and log2()= 0.05 and Gene B, with 50 reads nad log2() = 0.1, then the weighted average would be ((100*0.05) + (50*0.1))/(100 + 50) = 0.067. For more details on how to calculate a weighted average, see en.wikipedia.org/wiki/Weighted_arithmetic_mean
12:31 I like it
:)
it's bananas that the top/bottom 30% of fold changes are discarded. is the reason because they prone to being +/- inf? tricky that values less than 1 lead to exploding ratios
Can you tell me what time point you're asking about (minutes and seconds)?
@@statquest 9:47 but it appears they aren't actually dropped from the analysis, just the calculation of the scaling factor, which makes sense
@@LayneSadler Yep, that's correct. We just want the housekeeping genes for the scaling factor.
I really laughed my ass off at 12:30, thanks for the video.
To my understanding, isn't it weird that it's possible to have a reference sample for a gene where there are 0 reads on that gene? Wouldn't it be possible to take a reference sample for each gene to avoid this issue? I don't see how this makes sense logically, but I might have missed something. Thank you!
What time point, minutes and seconds, are you asking about?
Well, fine but how to use EdgeR ?
To be honest, I found the manual for edgeR relatively easy to follow. It has a lot of examples.
@@statquest Actually, I couldn't find any good workflow tutoriel for EdgeR on youtube, with like coding explanations, etc. if you have time to publish a good video about that, it would be extremely helpful.
@@someone_there I wish I could, but it's been years since I used edgeR. :(
@@statquest oh I see... well, thanks a lot for your answers anyway :)
Edge R seems to make more sense than DESEQ2 to me.
Noted
I've always felt that,EdgeR's approach seems more arbitrary.