DESeq2 Basics Explained | Differential Gene Expression Analysis | Bioinformatics 101

Sdílet
Vložit
  • čas přidán 31. 05. 2024
  • A basic task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. DESeq2 is one of the most commonly used packages to perform differential gene expression analysis in R. In this video, I have tried to explained the DESeq2 model and provide some intuition on what goes behind this package and the steps performed to call differentially expressed genes.
    I have tried my best to keep it simple and explain it to the best of my knowledge. Please feel free to leave your comments below, I am happy to hear your thoughts, as well as any links to articles/blogs/papers that you think, explain these concepts better! Let's use this space to share resources and learn more!
    Here are some resources that helped me to understand some of these concepts:
    1. bioconductor.org/packages/rele...
    2. genomebiology.biomedcentral.c...
    3. www.biostars.org/p/278684/
    4. www.biostars.org/p/316488/
    5. uclouvain-cbio.github.io/WSBI....
    Chapters
    0:00 Intro
    0:29 A typical study design
    1:32 Features of RNA-Seq counts data
    3:04 Poisson distribution for counts data
    5:14 Why is Poisson not the best model?
    6:58 Negative Binomial is the way to go!
    8:46 DESeq2 steps
    9:32 Biases in counts data
    12:29 Estimate Size Factor (median of ratios method)
    16:37 Estimate Dispersions
    20:00 Generalized Linear Models
    24:21 Hypothesis testing
    Show your support and encouragement by buying me a coffee:
    www.buymeacoffee.com/bioinfor...
    To get in touch:
    Website: bioinformagician.org/
    Github: github.com/kpatel427
    Email: khushbu_p@hotmail.com
    #bioinformagician #bioinformatics #deseq2 #differentialgeneexpressionanalysis #rnaseq #fpkm #rpkm #tpm #normalization #rna #ncbi #genomics #beginners #tutorial #howto #omics #research #biology #ngs

Komentáře • 99

  • @wesleyeliasbheringbarrios8108

    Based on the videos I've seen from your channel, everything is really great! Everything we bioinformatics beginners need most: theory explained without complication, in a way that is easy to understand, and guiding us step by step through the process with a list of videos in logical order. I just have to thank you for all your commitment and 100% accurate work on these videos, PLEASE continue! I will credit you in several of my presentations, thank you very much!

    • @Bioinformagician
      @Bioinformagician  Před 2 lety +6

      I really appreciate your kind words, really encourages me to keep doing this :) Thank you very much!

  • @kitdordkhar4964
    @kitdordkhar4964 Před 2 lety +4

    You are a great teacher! I enjoy watching the detailed theory and analysis. Your tutorial is very helpful. Cheers!

  • @coolalexpcs
    @coolalexpcs Před 2 měsíci +1

    You explain this in very clear and logical way! Appreciate it

  • @alicekao6305
    @alicekao6305 Před rokem

    Thank you so much! This is very clear. I like the series of your video talking about the logic behind each bioinformatic package. I think it's extremely important for me with biology background to know the basics of each package and identify the best tool to use when I get my data.

  • @QAKS1264
    @QAKS1264 Před 2 lety +1

    Very helpful, clear and accurate explanation. Thank you.

  • @chrisspeed8432
    @chrisspeed8432 Před rokem

    This was incredibly helpful. I plan to watch it again and take detailed notes along the way. Thank You!

  • @kmrsongh
    @kmrsongh Před 11 měsíci

    Really very helpful video tutorial. I appreciate the effort you made in explaining the DESeq2 background statistics. You explain them perfectly and in a very simple manner. very helpful for us. Thanks a lot and keep sharing such informative videos.

  • @IndigoIndustrial
    @IndigoIndustrial Před rokem +2

    Very impressive. More scientists should be engaging the way you do.

  • @priyankabiotech87
    @priyankabiotech87 Před 2 lety

    U made it so simplified..loved ur explanation..thank you

  • @user-up7ms2cs7m
    @user-up7ms2cs7m Před 28 dny

    Thank you! This was helpful! My study design was complex, as I was looking at 4 different conditions, with one reference level.

  • @abhisheksawalkar1018
    @abhisheksawalkar1018 Před 9 měsíci

    Simply excellent. Everything was explained using lucid examples. Very good for beginners.

  • @andydavidson3097
    @andydavidson3097 Před 4 měsíci

    Very well done! I watched lots of video on DESeq2 nobody explains the underlying math!

  • @BarcodeIIIlIIllIlll
    @BarcodeIIIlIIllIlll Před rokem +1

    I am writing a thesis that is partially reliant on bioinformatics and have no experience with deseq2. This video was immensely helpful in getting me up to speed in general understanding. Thank you very much!

  • @amus21455
    @amus21455 Před rokem +1

    It is superrrrr helpful!!!!!!! This is the best video about DESeq for someone with zero background like me!

  • @preeti97rox
    @preeti97rox Před rokem

    Thank you for being so helpful to everyone!

  • @chibrina
    @chibrina Před 8 měsíci

    this is amazingly helpful as a beginner, thank you

  • @anvieb1293
    @anvieb1293 Před rokem

    This is such a valuable and informative video, thanks so much!

  • @riaztabassum8395
    @riaztabassum8395 Před rokem

    very detailed and simplest explanation. 👌

  • @grace-426
    @grace-426 Před 19 dny

    I am so thankful for your tutorials.. can you please make one video on like, how to manage so many genes and how to come to some conclusion after getting so many genes

  • @muhammadhafizsulaiman7163

    Very smart person. Great explanation

  • @humphreygardner6982
    @humphreygardner6982 Před 11 dny

    Really superb! Thank you!

  • @bobyang8491
    @bobyang8491 Před 2 lety

    very helpful!! Thanks for teaching!

  • @KTROWS
    @KTROWS Před 3 měsíci

    Amazing job explaining.

  • @aldaszarnauskas27
    @aldaszarnauskas27 Před rokem

    Great video and explanation!!!

  • @devinjones7271
    @devinjones7271 Před rokem

    This is SO helpful! Thank you!!!

  • @joseoviedo4529
    @joseoviedo4529 Před rokem

    hello, I truly appreciate your videos and explanations. They are very clear and concise. I do have a request though for a future video. Could you do a how-to on gene set analysis using a GO class annotation and how to filter the desired genes from the completed DE analysis data frame. Thank you for all you do, Keep it up!

  • @islamalmsarrhad2152
    @islamalmsarrhad2152 Před 6 měsíci

    That was epic.. Many thanks

  • @reakal7740
    @reakal7740 Před 2 lety

    Great video! Congrats!

  • @MahdiAbdul-Jabbar
    @MahdiAbdul-Jabbar Před 29 dny

    Awesome video!

  • @aditimehta4886
    @aditimehta4886 Před 2 lety

    Hey Khushbu, really nice explaination.😊

  • @pooriasalehi5402
    @pooriasalehi5402 Před 9 měsíci

    really really thanks ma'am, it's amazing, I owe you.

  • @benjaminbergey5512
    @benjaminbergey5512 Před 2 lety +7

    One of the clearest explanations of the DESeq pipeline - thank you. Question about using the GLM: Do you know where I might find an example of a calculation for a given gene? I had a bit of difficulty following through the calculations, and I think a concrete example (just with arbitrary data) might help me grasp it better.

    • @Bioinformagician
      @Bioinformagician  Před 2 lety +1

      I am glad you found this video helpful! Check out this paper: www.ncbi.nlm.nih.gov/pmc/articles/PMC7873980/
      It does a fantastic job explaining single and multi factor linear models with calculations.

  • @AA-gl1dr
    @AA-gl1dr Před rokem

    excellent video.

  • @jamesrauschendorfer9396

    This is super helpful!

  • @tushardhyani3931
    @tushardhyani3931 Před rokem

    Thank you for this video !!

  • @sarahnawaz6925
    @sarahnawaz6925 Před měsícem

    Amazing💯

  • @PriyaDas-zw5hn
    @PriyaDas-zw5hn Před 10 měsíci

    Hi Dr. Khushbu,
    Thankyou for the very informative videos. Learning a lot from these. I had a query, if we have a time series of treated and untreated samples, should the pairs of treated and untreated at each time point be considered separately for estimating size factors?

  • @amirhosseinshafieian3951

    Really love that, after watching lots of videos on CZcams, finally I understood what's going on by ur video, I only could not understand the MLE part, if it is feasible for u please make a video to elaborate it in more detail.
    Thanks a lot

    • @Bioinformagician
      @Bioinformagician  Před rokem

      I will think about making a separate video explaining MLE. Thanks :)

  • @farihachaudhary577
    @farihachaudhary577 Před rokem +1

    Hi there, i just wanted to ask that if we can use DEseq analysis for unpaired data. I have 11 samples of normal (control) and about 160 tumor samples. Or we should go with paired data?

  • @amrsalaheldinabdallahhammo663

    Thanks for that video, You are genius :)

  • @angelamoreira5023
    @angelamoreira5023 Před 2 lety

    Excellent!!!

  • @harshasatuluri4540
    @harshasatuluri4540 Před 2 lety

    Very clear!

  • @VenuraHerathPhotography
    @VenuraHerathPhotography Před 2 lety +1

    Keep up the good work! Would love to see a tutorial on edgeR time-series differential analysis.

    • @Bioinformagician
      @Bioinformagician  Před 2 lety

      Will plan a video covering this. Thanks for the suggestion :)

  • @khalildabrat4593
    @khalildabrat4593 Před 2 lety

    Very helpful!

  • @georgyjogen2859
    @georgyjogen2859 Před 11 měsíci +1

    Hi,
    Really like your video. thank you for the channel once again. Its a blessing.
    I have a small doubt.
    @11:27 you said that since gene D is not expressed in treated condition the total of 42 from untreated needs to be divided amoung the expressed 3 genes, causing it to be inflated. How is that, could you please explain?
    Thanks in advance

  • @kobrarahimi9164
    @kobrarahimi9164 Před 2 lety +1

    it was great 100 out of 100.

  • @ghadeeralkurdi174
    @ghadeeralkurdi174 Před rokem +1

    Could i ask you what are the range of x and y axis you used in mean vs variance plot at 6:27 min

  • @CarlMedriano
    @CarlMedriano Před 11 měsíci

    Thanks for this info, I am just a bit lost especially when I try to calculate using gene D which resulted to GM of 0 and reference values of 0. Wouldnt the following steps result to 0 (assuming that values /0 are just placed as 0)?

  • @leia2636
    @leia2636 Před 2 lety

    wow that was magical

  • @abdourahamandjibotassiou4367
    @abdourahamandjibotassiou4367 Před 8 měsíci +1

    very nice

  • @NguyenThiPhuongLan-in5cd
    @NguyenThiPhuongLan-in5cd Před 10 měsíci

    Hi may I ask if we have n=3 biological replicates/2 groups how can we put in 2 groups? Just calculate mean of read counts for each genes in each group?

  • @emojiman745
    @emojiman745 Před 2 lety +1

    I may have missed it, but what do we do in with the reeplicates? You mentioned the replicates in the study design segment (00:38), but the calculations you display are about one group. Should we take the mean of the samples and make them into one column? one column for the treated (mean of the b1, b2 and b3 for t1 and b1, b2 and b3 for t2) and one column for untreated (mean of the B1, B2 and B3 for T1 and B1, B2 and B3 for T2)?

    • @Bioinformagician
      @Bioinformagician  Před 2 lety +1

      Apologies if I wasn't clear in my video, there are ways to handle technical replicates. Check this section out from DESeq2 vignette: bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#collapsing-technical-replicates
      With regards to biological replicates, you should NOT collapse biological replicates.

  • @1993dana15
    @1993dana15 Před 2 lety +1

    crispy clear

  • @patticat
    @patticat Před měsícem

    Is this the video where design factor was explained? I'm coming from another of your videos where you say "if you don't know design factor, look at my previous video" but you never said which one. I think this one was a good candidate, however, I am still very confused as to how to use the design factor.. that was x= 0 or x =1? or what was that when you added two conditions? I'm super lost with the last 2 seconds of explanation there.. if you have another video explaining this, which one is it? Thanks! Everything else is on point!

  • @rays_of_hopes
    @rays_of_hopes Před rokem

    Thank you so much mam

  • @LongboardTrickfreak
    @LongboardTrickfreak Před 2 lety +2

    I might be mistaken but are you shure the values for calculating the median in step 3 (est. size factors) are correct? When i calculate them with R i get 0.45 for instance for the normalizatiom factor untreated. Shouldn‘t the median be one of the values? Apart from that: great video, helped me a lot!

    • @Bioinformagician
      @Bioinformagician  Před 2 lety +2

      Thanks for reaching out! I am sure, the median of values 0, 0.45, 0.55, 0.58 is 0.5. I calculated it using R as well.

    • @user-sf1ys2wl4k
      @user-sf1ys2wl4k Před rokem

      Hey, the calculations in the video are correct. But maybe you were confused because those are medians, not means. In the case of 4 values, you have to take two values in the middle and then the average of them;) So we take 0.45 and 0.55 and get 0.50.

  • @kevinradja
    @kevinradja Před 2 lety +1

    Really love your video and is inspiring me to also try making my own videos and test my knowledge. At 14:30 there's an error when you are estimating the size factor. The geometric mean is calculated by the mean of the natural log of the counts (ln because that is what DESeq2 uses). Taking the log turns the Pi symbol in the paper into a sigma of logs. Might also be good to mention that it isn't square root if you have more than two conditions. If I'm wrong though, someone please let me know!

    • @Bioinformagician
      @Bioinformagician  Před 2 lety

      Thank you for pointing out that error. You are right, DESeq2 uses natural logs and it would be 1/nth power of the total of multiplied terms. I should have mentioned it. However, the values barely differ with the method chosen. Just for the explanation, I chose the multiplying method because it has fewer steps which makes it easier to understand and gets the point across :)
      Geometric mean with log method:
      log(2) + log(10)
      = 2.99/2
      = 2.718281828459^1.495 OR exp(1.495) (taking antilog)
      = 4.459337
      Geometric mean with multiply method:
      sqrt(2*10)
      = 4.472136

    • @kevinradja
      @kevinradja Před 2 lety +1

      That's a great point and shows why we take the log! With large outliers the averages of logs are less affected than regular averages but doesn't change when the values are close. Also do you plan on making a video on the dispersion in DESeq2 in more detail? There's so much more in the paper I didn't understand at all.

    • @Bioinformagician
      @Bioinformagician  Před 2 lety

      @@kevinradja I will surely think about making a video on dispersion in more detail :)

  • @you-mingliu3261
    @you-mingliu3261 Před rokem +1

    Great video, but I'm still confused about the dispersion α. For one gene, the α was estimated separately in the control group and treatment group (So, there are 2 α for one gene)?
    Or there is only one α for each gene which means the mean and the variance were calculated cross the control and treatment group?

    • @Bioinformagician
      @Bioinformagician  Před rokem

      As far as my understanding goes, it the latter. The mean and variance is calculated across all groups, so there is only one α for each gene.

  • @alexyang274
    @alexyang274 Před 2 lety

    question regarding the coefficients for the fitting the linear model - from my understanding, based on this explanation, the linear model can accommodate theoretically infinite number of coefficients. in the vignette for deseq2, michael love mentions that while deseq2 can do this, it is perhaps easier to concatenate multiple factors into a single variable and have deseq2 perform its linear modeling this way. can you explain why this is the case? and how this can extend from a 2-factor design to a n-number design and so forth?

    • @Bioinformagician
      @Bioinformagician  Před 2 lety

      Can you point me to the section in the vignette where Michael Love talks about concatenating multiple factors into a single variable?

    • @alexyang274
      @alexyang274 Před 2 lety +1

      @@Bioinformagician in the vignette, the subheading is under "interactions"; copied and pasted from the vignette, love writes:
      Initial note: Many users begin to add interaction terms to the design formula, when in fact a much simpler approach would give all the results tables that are desired. We will explain this approach first, because it is much simpler to perform. If the comparisons of interest are, for example, the effect of a condition for different sets of samples, a simpler approach than adding interaction terms explicitly to the design formula is to perform the following steps:
      combine the factors of interest into a single factor with all combinations of the original factors
      change the design to include just this factor, e.g. ~ group
      Using this design is similar to adding an interaction term, in that it models multiple condition effects which can be easily extracted with results.

    • @Bioinformagician
      @Bioinformagician  Před 2 lety

      Thank you for pointing me to this.
      I want to bring in a little context here, without it can be misleading.
      I have tried to explain it here: khushbupatel.notion.site/Interaction-terms-DESeq2-5a4a75b83adc4fe89576e6ee9b00daf0
      Hope this clears your confusion and answers your question. Thanks! :)

  • @wansabaiinjapan1586
    @wansabaiinjapan1586 Před 5 měsíci

    Very excellent explanation. Thank you! I am too new to the field. I have questions regarding how we can use or what values we will use to make heatmap, Venn diagram, etc. In 15.49, once we get median of ratio and normalize our samples with this value to obtain norm_values for each gene of each sample. Before I use these value to plot heatmap. Do I need to again transform to log2? Or do I need to convert to z-Score? if yes, how to get z-score for each gene in each sample? Sorry for asking so many questions. Thanks in advance!

    • @adaobiokafor9546
      @adaobiokafor9546 Před 4 měsíci

      for visualizations, you need to scale (ie. calculate z scores). Just use the scale() function in R.

  • @adrianozaghi9209
    @adrianozaghi9209 Před rokem

    Thank you so mutch, the paper about this algorithm is complex asf

  • @georgeanthonywalters-marra9628

    Hello, this was an awesome and very informative video! I've been trying to learn more about CRISPR screen analysis (specifically MAGeCK). Are you familiar at all with analysis of CRISPR screens and would you say that the concepts in this video would be transferable? Thank you so much!

  • @shetalkzz8842
    @shetalkzz8842 Před měsícem

    can I perform deseq2 in galaxy for finding differentially expressed mirnas

  • @clutch3171
    @clutch3171 Před 3 měsíci

    this is secretly genius

  • @saranyasweet
    @saranyasweet Před rokem

    Mam please do put videos for how to do DGE for raw 16srDNA paired end data in fastq format ?

  • @user-zc9jl2to3h
    @user-zc9jl2to3h Před 11 měsíci

    In 22:53, why do you say that "y - B0 = log(y) - log (B0)" ???? isn't that incorrect?

  • @poojasavla6240
    @poojasavla6240 Před 10 měsíci

    bro i love you

  • @justsoil15
    @justsoil15 Před rokem

    I use docker and command line to run deseq2. How to save plots to png files?

  • @jatinderchera1613
    @jatinderchera1613 Před rokem

    Hello mam. Your video is very helpful especially for beginners like me. I have some queries and I would be very grateful if you can help me out. We got RNAseq done from a company and they have provided us with analyzed data. My queries are :
    1. They have provided PCA plot and they have mentioned the following, "DESeq2 generates PCA plot based on a matrix of normalized read counts,the result typically depends only on the few most strongly expressed transcripts because of showing largest absolute differences between control and treated samples." The plot they provided showed very high variance among the biological replicates of one treatment group (due to lower read count in some samples). Is there any way to get around this by considering some other features (apart from read counts) to compute variances ?
    2. They have also provided RPKM values of various genes that are unique to specific treatment groups. I observed some of the genes had 'zero' reads in some of the replicates of the same treatment group. Can we consider these genes for our analyses ?
    3. I also observed completely identical RPKM values for many genes in the list (identical even upto 9 decimal places). What could be the reason for this and can we proceed with the analyses of such genes ?
    Any help from your side would be highly appreciated. 😊

    • @Bioinformagician
      @Bioinformagician  Před rokem

      1. Do you happen to know how low are the read counts among biological replicates of that one treatment group? You could perhaps take a look a pre-alignment and post-alignment QC especially total number of reads and total number of uniquely mapped reads for each sample. Another way to identify noisy/problematic samples is to use a distance matrix to get similarities or dissimilarities across samples.
      2. You could get total counts for genes across all samples and see if these genes with 0 reads have consistent low read counts across other samples as well. We would ideally want to remove genes with less than 10 total read counts across all samples. You could be more stringent and set a higher number.
      3. This seems suspicious. I would recommend to generate RPKM/TPM values yourself.

    • @jatinderchera1613
      @jatinderchera1613 Před rokem

      Thank you very much for your response mam. I am very new to such data types. I am learning everything from scratch so I will try my best to carry out whatever you suggested.

  • @user-uq7gw5ll5r
    @user-uq7gw5ll5r Před 2 měsíci

    Mam can u help me analyse rna sequence database using deseq2 tool pls

  • @donklike09
    @donklike09 Před rokem

    Awesome! but how is 2/0.5 = 4.016...? isn't it just 4? (16:14) and same with the other numbers from the untreated.

    • @Bioinformagician
      @Bioinformagician  Před rokem +1

      You’re right. The discrepancy is due to rounding off. If you don’t round the numbers, you would get 4.016 instead of 4

  • @snekhai
    @snekhai Před 2 lety

    When you normalize counts, and have 0/0 (your sample D), why do you assign 0?

    • @Bioinformagician
      @Bioinformagician  Před 2 lety +1

      In step 1 to calculate geometric mean, we take square root of product of counts in all samples. For sample D, product of 30 x 0 = 0. Square root of 0 is 0. Hence 0.

    • @pgresner
      @pgresner Před rokem

      yes, but then, in Step 2, you divide 30/0 (which is infinity) and even 0/0 (which is undefined) - so why you get 0's for untreated/ref and treated/ref? is this some kind of a convention or just a mistake?

    • @Bioinformagician
      @Bioinformagician  Před rokem

      @@pgresner It’s a mistake. They should be Inf instead of 0s. I didn’t mention a very important point, non-finite values (i.e Inf, -Inf and NaN) are filtered out and not used to calculate the median. Thank you for pointing it out, I shall put a note about this in the description.

  • @relaxstation600
    @relaxstation600 Před 11 měsíci

    13:40 step1

  • @user-ku2po2by4k
    @user-ku2po2by4k Před 11 měsíci

    Can you show the source pipeline code? My brain are overheated

  • @user-uh6ms8yj5g
    @user-uh6ms8yj5g Před 8 měsíci

    Generalized linear model equation explanation was not very basic, otherwise a great presentation

  • @mrbane2000
    @mrbane2000 Před rokem

    Biocutieisian