Thank you for your tutorial. I would like to make a correction on one of your slides. There is a slide at timepoint 8:00 where you show the paper "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis" by Marie-Agnès Dillies et al (2012) where you show as a key point the following: "FPKM and TC are ineffective and should be definitely abandoned in the context of differential analysis". I have read the paper and it mentions the following as a key point (I quote): "The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis." Therefore, I think that FPKM should be replaced with RPKM on the given slide. Thank you again for your tutorial.
Thank you for raising a good point! FPKM stands for "fragments per kilobase million", and is used for paired end reads, whereas RPKM ("reads per kilobase million") is for single end reads. Nowadays paired end reads are typically used in RNA-seq, so we used that term, but of course the concept is exactly the same. RNA-seq blog has a nice post about these terms: www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
We have added some clarifications on (log) fold change and other statistical testing related terms here: chipster.rahtiapp.fi/manual/statistical-terms-explained.html
Thank you very much, This tutorial is really helpful. I have a question In the Despersion plot, we see the black dots show expression level of genes. What does it show if a significant number of black dots are dispersed instead of closer to blue area?
Good question! So the dispersion plot by DESeq2 (explained here: czcams.com/video/5tGCBW3_0IA/video.html) shows the genes as dots in a plot, where x-axis represents the expression level (=means of counts) and y-axis is the variability of the gene expression (=dispersion) within the different samples. Black dots are the genes before the shrinkage, and blue genes are the same genes after the shrinkage (plus the outliers above the cloud). So if the black dots are more scattered, it would mean that the level of variability of the genes varies more -some genes counts "agree" more on different samples, whereas others vary more. Things like smaller number of samples or heterogeneity of the samples might be reflected as more variability.
Thanks for the explanation. My Questions are 1.What padj value should be used to determine if the identifier should be considered as a result for DeSeq2 Differential testing? Will it still be ? 2. What is considered to be a good cutoff for Log3FoldChange to deduce that the Identifier is differential to a group?
Good questions Ramani! 1. As the adjusted p-value is FDR (false discovery rate), you need to decide what proportion of the differentially expressed genes in your list you tolerate to be false positives. The typical threshold is 0.1, which means that 10% of the reported DE genes may not actually be differentially expressed. In other words if you get 500 DE genes, 50 of them might be false. 2. In my opinion it is difficult to decide a biologically meaningful cutoff for the log2FoldChange, as even small expression changes in some genes can be important. Remember too that DESeq2 "shrinks" fold changes of low count genes towards 0 in order to avoid false positives.
We have added some clarifications on (log) fold change and other statistical testing related terms here: chipster.rahtiapp.fi/manual/statistical-terms-explained.html
Your video was the most intelligible on the topic. Thanks for making this public!
I appreciate that!
Fantastic video. Very informative for a diffexp newbie! Gives me enough info to understand the other stuff online!
I’m starting RNA-seq analysis and these tutorials are very helpful thank you so much.
A really good explanation about statistical testing for differential expression
I present my dissertation proposal TOMORROW, and this helped me so much!
this video is just awesome. you explained it just perfectly
Very helpful!
I have started with RNA seq analysis in reference to this tutorial. Thanks
Thank you very much. This video helped a lot.
A really useful explanation! This video is like the Rosetta of the DESeq2 vignette.
Glad it was helpful!
I'm sure it was no pun intended but when she said "Blue Gene" at 6:28, I loled!
Really nice course with a clear explanation. Great for beginners. Thank you.
Glad you enjoyed it!
This was super helpful, thanks a lot!!!
You're welcome!
The video sound is pretty good, beyond my imagination
Thank you for your tutorial.
I would like to make a correction on one of your slides. There is a slide at timepoint 8:00 where you show the paper "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis" by Marie-Agnès Dillies et al (2012) where you show as a key point the following:
"FPKM and TC are ineffective and should be definitely abandoned in the context of differential analysis".
I have read the paper and it mentions the following as a key point (I quote):
"The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis."
Therefore, I think that FPKM should be replaced with RPKM on the given slide.
Thank you again for your tutorial.
Thank you for raising a good point! FPKM stands for "fragments per kilobase million", and is used for paired end reads, whereas RPKM ("reads per kilobase million") is for single end reads. Nowadays paired end reads are typically used in RNA-seq, so we used that term, but of course the concept is exactly the same. RNA-seq blog has a nice post about these terms: www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
I think the section on logfold change can be clarified by describing in more depth what the logfold change is
We have added some clarifications on (log) fold change and other statistical testing related terms here: chipster.rahtiapp.fi/manual/statistical-terms-explained.html
Can you please discuss how limma+voom works in detail?
Thank you very much, This tutorial is really helpful. I have a question
In the Despersion plot, we see the black dots show expression level of genes. What does it show if a significant number of black dots are dispersed instead of closer to blue area?
Good question! So the dispersion plot by DESeq2 (explained here: czcams.com/video/5tGCBW3_0IA/video.html) shows the genes as dots in a plot, where x-axis represents the expression level (=means of counts) and y-axis is the variability of the gene expression (=dispersion) within the different samples. Black dots are the genes before the shrinkage, and blue genes are the same genes after the shrinkage (plus the outliers above the cloud). So if the black dots are more scattered, it would mean that the level of variability of the genes varies more -some genes counts "agree" more on different samples, whereas others vary more. Things like smaller number of samples or heterogeneity of the samples might be reflected as more variability.
Thanks for the explanation.
My Questions are
1.What padj value should be used to determine if the identifier should be considered as a result for DeSeq2 Differential testing?
Will it still be ?
2. What is considered to be a good cutoff for Log3FoldChange to deduce that the Identifier is differential to a group?
Good questions Ramani!
1. As the adjusted p-value is FDR (false discovery rate), you need to decide what proportion of the differentially expressed genes in your list you tolerate to be false positives. The typical threshold is 0.1, which means that 10% of the reported DE genes may not actually be differentially expressed. In other words if you get 500 DE genes, 50 of them might be false.
2. In my opinion it is difficult to decide a biologically meaningful cutoff for the log2FoldChange, as even small expression changes in some genes can be important. Remember too that DESeq2 "shrinks" fold changes of low count genes towards 0 in order to avoid false positives.
How to calculate 5 fold change difference??
Hi, could you maybe clarify your question? What do you mean by "5 fold change difference"?
We have added some clarifications on (log) fold change and other statistical testing related terms here: chipster.rahtiapp.fi/manual/statistical-terms-explained.html