Differential expression in Python with pyDESeq2

Sdílet
Vložit
  • čas přidán 2. 07. 2024
  • Analyze RNAseq counts data with a Python implementation of DESeq2. I cover basic differential expression analysis, PCA plots, GSEA, heatmaps, and volcano plots.
    Github:
    github.com/mousepixels/sanbom...
    The samples include normal human cell control and replicative senescence cells from NCBI accession GSE171663
    0:00 Intro
    0:30 Differential expression
    7:30 PCA
    8:30 GSEA
    11:04 Heatmap
    15:10 Volcano
  • Věda a technologie

Komentáře • 93

  • @sanbomics
    @sanbomics  Před 11 měsíci +2

    Fix: You should used the normalized counts layer for the PCA step.

    • @sarap5548
      @sarap5548 Před 11 měsíci

      Thanks! Could you put the code for it?

    • @cagankaplan5302
      @cagankaplan5302 Před 8 měsíci

      Should we specify sc.tl.pca(dds.layers["normed_counts"] or just adding the log1p layer to dds before the pca is enough?

    • @user-rb2pt1zk1s
      @user-rb2pt1zk1s Před 5 měsíci

      @@sarap5548 did you know please how to fix it in the code ?

    • @vincenzomazza2880
      @vincenzomazza2880 Před 5 měsíci

      sc.pp.log1p(dds)
      sc.tl.pca(dds)
      sc.pl.pca(dds, color = 'Condition', size = 30)

  • @vjsanchezarevalo
    @vjsanchezarevalo Před rokem

    Great video! I have been using pyDeseq2 and works great! Very cool the mapper function! Your videos are always very helpful, thanks a lot!

    • @sanbomics
      @sanbomics  Před rokem

      Thank you for the kind words :)

  • @louis-philippechaumont8400

    I watched a bunch of your videos. They are really helpful and well made. I like that you present both python and R. They are helping me get started in bioinformatics!

    • @sanbomics
      @sanbomics  Před rokem

      I am glad you have found them helpful!

  • @nikelElegance
    @nikelElegance Před 4 měsíci

    AMAZING. Thank you

  • @heatherpeng7437
    @heatherpeng7437 Před 7 měsíci

    Hello, thank you so much for this great, informative tutorial! I'd like to ask what does the PCA plot mean in your example? Thank you.

  • @mikezhao5294
    @mikezhao5294 Před 11 měsíci

    Your video is super helpful! I am wondering is any pipeline for differential expression analysis for ST data, or we just use SC pipeline? Thanks!!

  • @AbelDavid-qc5xy
    @AbelDavid-qc5xy Před 23 dny

    Thank you for this very helpful tutorial. is there a function for plotting the euclidean distance map for each sample?

  • @lly6115
    @lly6115 Před rokem +1

    I will click your video everyday to pump that viewed number.(lol)
    Thank you!

    • @sanbomics
      @sanbomics  Před rokem +1

      Doing the lord's work

    • @lly6115
      @lly6115 Před rokem

      @@sanbomics thank you for your respectul endeavor and time!

  • @mocabeentrill
    @mocabeentrill Před rokem

    Been a while man! Hope you've been well.

    • @sanbomics
      @sanbomics  Před rokem +2

      Been busy with werk and haven't been able to do one. But doing well, thanks!

  • @lamjennygrace623
    @lamjennygrace623 Před 8 měsíci

    Thank you for such a comprehensive tutorial. I'm totally new to Bioinformatics and I found your videos are very helpful. Could you please do a tutorial on how to get the count table and sample file from scratch in Python?? I would very appreaciated

    • @sanbomics
      @sanbomics  Před 7 měsíci

      Hi! You should be able to find guides on how to open csv with pandas. If you can load the counts into pandas you should be able to follow along with the rest

  • @marwanmohamed3844
    @marwanmohamed3844 Před 3 měsíci

    Hey! Many thanks for this tutorial!
    i got a questions regarding the design factors, am interested in DE to check effect of AGE, while i want to control for SEX and also want to test for there interaction , do you know how to implement the interaction term in the design factors?
    design_factors. = ["Age","Sex","Age x Sex"],?
    continuous_factors=["AGE"]

  • @gerolduntergasser4000
    @gerolduntergasser4000 Před 11 měsíci

    cool

  • @paulavazquez5135
    @paulavazquez5135 Před 8 měsíci

    Thanks for the great tutorial. How can I get a data frame with the normalized counts for each gene in each sample?

    • @sanbomics
      @sanbomics  Před 7 měsíci +1

      Without checking.. I think it is in one of the layers of the dds object. Something like dds.layers['norm_counts'}

  • @juanjara2221
    @juanjara2221 Před 3 měsíci

    where can I download count table? in the GSE accesion we only have normalized counts or raw data

  • @TakaTatsumi
    @TakaTatsumi Před rokem

    thanks !

  • @JAIRAJMATHUR
    @JAIRAJMATHUR Před 11 měsíci

    I love your videos, and I am deeply grateful for all the content, especially python related that you put out! Hope to see more in the future!!
    What exactly does baseMean tell us? How would you set the threshold for it, as it could vary from dataset to dataset?
    Also when you use scanpy for PCA, you didnt normalize your data, or log transform it. Are those not necessary when using sc.tl.pca?
    Another approach that I have tried and cross verified with R, is to use vst transformation using the dds.vst function, and then use this vst transformed data to make PCA, do you think this would make your PCA plot look better?
    Sorry, one last question: For GSEA, What would be the difference using stat for ranking vs sign(LFC)*-log10(padj) vs LFC*-log10(padj)
    Also, pre_res.res2d gives a nice dataframe with everything :)

    • @sanbomics
      @sanbomics  Před 11 měsíci

      Thank you! Basemean is just the mean of all the samples for that given gene. Setting a threshold is somewhat arbitrary. You could do it based on the distribution but most people just pick some number between 10-100, unfortunately I haven't seen standardized guidelines. Thanks for pointing that out.. it is actually a mistake on my part. Should have specified the normed_counts layer. Stat vs other metrics for GSEA is another one of those things that hasn't really been standardized. I like stat more now because it doesn't require an extra line of code and it doesn't seem to be inflated as badly as p-value can be sometimes. I would expect the actual results to not differ much though. Wow, thanks for pointing out pre_res.res2d xD

    • @sanbomics
      @sanbomics  Před 11 měsíci

      I'm going to pin this message because it has several important points. Thanks again for commenting!

    • @burcakotlu7858
      @burcakotlu7858 Před 11 měsíci

      @@sanbomics I also liked the content. First of all, thanks for it. Regarding PCA, when I apply vst and then PCA in R, I got more meaningful PCA plot than the scanpy PCA without any transformation beforehand for the same bulk RNA-seq data at hand.

    • @burcakotlu7858
      @burcakotlu7858 Před 11 měsíci

      @@sanbomics By the way, I got error for GSEA. gp.prerank(rnk=ranking,
      gene_sets=['GO_Biological_Process_2023', my_gene_set],
      seed = 6, permutation_num=100) givesTypeError: ufunc 'isinf' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' on my side.

  • @user-zm6ip8iy9q
    @user-zm6ip8iy9q Před rokem +2

    As a beginner in bioinformatics, I am watching the uploaded videos well. There are cases where TCGA data is used for cancer analysis at the single cell level, so if you have time, could you make a related video? Thank you always.

    • @sanbomics
      @sanbomics  Před rokem

      What analysis are you trying to do?

  • @ustigergirl
    @ustigergirl Před 8 měsíci

    at the 'mapper' stage if your species is mouse do you change it to 'mouse'? What would be the correct format?

    • @sanbomics
      @sanbomics  Před 8 měsíci

      just make sure to specify species = mouse when you initialize the mapper the first time. Then you can use it over and over again without specifying

  • @adekunleajiboye1244
    @adekunleajiboye1244 Před rokem

    Thanks for this video. I was not getting the one in R. How do I get your contact?

    • @sanbomics
      @sanbomics  Před rokem

      You can send me a message on twitter!

  • @vidalmrc
    @vidalmrc Před 4 měsíci

    Hey! Many thanks for this tutorial!
    I don't know if something changed in the PyDeseq2 API but the line "stat_res.summary()" triggers this error:
    "IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed"
    I have spent about 1h now trying to troubleshoot it without any success.
    I use the "count_table_for_deseq_example.csv" from your github, and I tried going throught the official "Getting Started" to end up with the same error...
    Any idea what's going on ?

    • @sanbomics
      @sanbomics  Před 4 měsíci +1

      They have made a lot of changes since I made this video, to the point where I am thinking about doing a new one. I'm not sure I have seen that error at that specific point before, but I have had that same error somewhere else in the workflow. Off the top of my head I can't remember what I did to fix it. It may be that something is off in the dds object. Hard to say without taking a closer look. Let me know if you figure it out

    • @vidalmrc
      @vidalmrc Před 4 měsíci

      @@sanbomics Thanks for your reply, and your videos in general!

  • @alexrutherford3105
    @alexrutherford3105 Před 9 měsíci +2

    I'm making a dds and I got a weird error of:
    TypeError: DeseqDataSet.__init__() got an unexpected keyword argument 'clinical'. has this been updated?

    • @Rodrigo-yx3sn
      @Rodrigo-yx3sn Před 8 měsíci +2

      I think it changed.I used it like this :
      dds = DeseqDataSet(counts=counts,
      metadata=metadata,
      design_factors="Condition")

    • @sanbomics
      @sanbomics  Před 8 měsíci

      Yeah, sorry for the late reply. They changed clinical to metadata

  • @serychristianrenaud
    @serychristianrenaud Před rokem

    thanks

  • @analeighgui4693
    @analeighgui4693 Před 11 měsíci

    Thx a million for the video! I have a general question about Python vs. R. For people new to bioinformatics, R is usually the lang they start with. Many cool packages are written in Python (and more and more). Do you see benefits for bioinformatics newbies to adopt python? You have great videos on both langs, so I want to see your opinion.

    • @sanbomics
      @sanbomics  Před 11 měsíci +1

      You'll need to learn at least a little of both. But Python is the future in my opinion. I did a video on this a while back: czcams.com/video/vNH3HejAwSE/video.html

    • @analeighgui4693
      @analeighgui4693 Před 11 měsíci

      @@sanbomics Thanks for sharing! I rely heavily on R but have come across more and more often Python packages, especially in VDJ analysis. Got to seriously pick up the Python now!

  • @Shin-ct1ts
    @Shin-ct1ts Před 2 měsíci

    Thank you for your great introduction. I have the questions in GSEA part. Though I could get ranking and wanna do all gene analysis in 9:40 min and always get the error like below.
    ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements
    How can I solve this problem...?

    • @sanbomics
      @sanbomics  Před měsícem

      Sometimes the server is down. Were you able to get it working?

  • @user-fn6fl4jl4x
    @user-fn6fl4jl4x Před 6 měsíci

    Hi, can i compare 4 groups(you use 2 groups)? Or DESeq2 only for two? Thanks for answer.

    • @sanbomics
      @sanbomics  Před 6 měsíci

      You can only do pairwise comparisons. But if you want to group them you can. for example, 3 vs 1. Or 2 vs 2. Very dependent on what you are trying to accomplish.

  • @lly6115
    @lly6115 Před rokem

    A quick simple question, do you know if pydeseq2 is capable of processing multiple groups? since I haven't being able to find that in its official docs.

    • @sanbomics
      @sanbomics  Před rokem

      What exactly are you trying to do?

    • @lly6115
      @lly6115 Před rokem

      @@sanbomics Thank you for your reply. Let's say I have RNA counts data for 9 samples, including control(3 sample), treatment1(3 samples), treatment2(3 samples). Is it possible to transform 9 columns of counts data to expression matix data using pydeseq2. Could be a blatant rookie question( so ignore this would be totally understandable). Thank you very much for your time.

    • @JAIRAJMATHUR
      @JAIRAJMATHUR Před 11 měsíci

      @@lly6115 @sanbomics I have been able to convert it into a single condition column, by joining the strings. Then I can specify which contrast I want to run and get my results.

  • @NadaHashimOsman
    @NadaHashimOsman Před 6 měsíci

    what would be the difference in the code it i have "normalized" paired data?
    whick steps should i neglect , whick steps should be edited to fit my case

    • @sanbomics
      @sanbomics  Před 5 měsíci +1

      If the data are already normalized you may not be able to use Deseq. I think limma works for that

  • @begomorras
    @begomorras Před 9 měsíci

    My dataset is just samples and doesn't have any control. I have made the metadata dataframe with all S in the conditions, as this is my design but then the prgram spits an error and says :
    ValueError: Factors should take at least two values, but Condition takes the single value '['S']'.
    I know that in R the program can be bypassed by writting " design =~1 " and I would like to get help on how to do so in python
    Thanks in advance

    • @sanbomics
      @sanbomics  Před 8 měsíci

      You don't need a control, but you do need more than one group to test. You need to specify the groups in the condition column

  • @DrRam_Undi
    @DrRam_Undi Před 2 měsíci

    Thank you for the video tutorial. I am not able to install pydeseq2 modules using jupyter notebook. Could you provide me any details on how to solve this? I am getting an error like this : ModuleNotFoundError: No module named 'scipy.stats._boost.nct_ufunc'.
    Thank you

    • @sanbomics
      @sanbomics  Před měsícem

      I would try making a new environment with python=3.9 and reinstall your packages. Try to install them all in one command so that pip can do a better job avoiding conflicts: pip install a b c d e f ...

  • @sasimuo
    @sasimuo Před 4 měsíci

    Thanks for your video . Sometimes I get a value error when I run dds.deseq2() , ValueError: The first guess on the deviance function returned a nan. This could be a boundary problem and should be reported. how do I fix it?

    • @sanbomics
      @sanbomics  Před 4 měsíci

      Hard to say without looking more closely. See if you filter more lowly expressed genes out if that gets rid of the error. For example, get rid of genes that aren't expressed in at least two samples

    • @sasimuo
      @sasimuo Před 4 měsíci

      @@sanbomics Thank you. when I filter (counts = counts[counts.sum(axis = 1) > 5] ), it helps. Sometimes I play with different counts sum value. However, if I get too high of counts sum value, I get another error which says perfect partition error. is there any parameters I could use in dds.deseq2() function to force to run that overcomes any error? Thanks

  • @chunliang8290
    @chunliang8290 Před 5 měsíci

    Can you install pyDESeq2 into your google drive. Then, later, you just need to re-mount your google drive so that the pyDESeq2 will be available next time you use it without re-installation?

    • @sanbomics
      @sanbomics  Před 5 měsíci

      No idea, I haven't tried this. Sorry!

  • @chunliang8290
    @chunliang8290 Před 5 měsíci

    I am using your sanbomics. Unfortunately, you only has human and mouse as the species. I work with chicken data. What is your suggestion for alternative solution here?

    • @sanbomics
      @sanbomics  Před 5 měsíci

      I would stick to ensembl/biomart for now

  • @Dr.UgurComlekcioglu
    @Dr.UgurComlekcioglu Před 8 měsíci

    Thank you very much for this great video. How can I get gene symbols of "Cow"?

    • @sanbomics
      @sanbomics  Před 8 měsíci

      You are doing DE analysis from cow samples? Interesting. Check ensembl for the species and get the gtf. The gtf will have ensembl id and gene symbol, you just have to parse the file

    • @Dr.UgurComlekcioglu
      @Dr.UgurComlekcioglu Před 7 měsíci

      @@sanbomics Thank you very much. I did it, and it worked.

  • @mehwishshoaib1591
    @mehwishshoaib1591 Před 9 měsíci

    When define condition colum you jave 8 samples for which yoù put C and RS manually what to do when we have more no of samples for exapmle first 50 are diseased and last 40 are normal for that how to define condition column

    • @sanbomics
      @sanbomics  Před 7 měsíci

      You can do ['disease'] * 50 + ['normal']*40 or you can import the metadata from csv

  • @ustigergirl
    @ustigergirl Před 8 měsíci

    can your sanbomic script/package be used for cuffdiff results (cufflink tool)?

    • @sanbomics
      @sanbomics  Před 7 měsíci

      Which part? The volcano plot?

    • @ustigergirl
      @ustigergirl Před 6 měsíci

      @@sanbomics sorry for the late reply, for heatmap and volcano plot. Thanks!

  • @imranluqman4622
    @imranluqman4622 Před 6 měsíci

    Hey! I got an error saying 'DeseqStats.__init__() got an unexpected keyword argument 'n_cpus''. Could it be that it's an old argument?
    Thank you!

    • @sanbomics
      @sanbomics  Před 6 měsíci

      Yeah possibly, it was a brand new package and they have been making a lot of changes. For example, they changed "clinical" to "metadata" since i've made the video too

    • @irvingfrancisco6822
      @irvingfrancisco6822 Před 5 měsíci

      Hey, do you know what is the new argument for 'n_cups'? I have the same issue

  • @AbelDavid-qc5xy
    @AbelDavid-qc5xy Před 26 dny

    Can you cover batch correction in python?

  • @LeticiaRojasC
    @LeticiaRojasC Před 11 měsíci

    Can pyDESeq2 be use for single-cell RNAseq instead of bulk?

    • @sanbomics
      @sanbomics  Před 11 měsíci

      Yup! I actually just did a pseudobulk video czcams.com/video/Ee0PQUwVH8Q/video.html

  • @alexrutherford3105
    @alexrutherford3105 Před 9 měsíci

    do you have a link to your sample count_table.csv ?
    thanks!

    • @sanbomics
      @sanbomics  Před 8 měsíci

      There is a counts table on my github, but I don't remember if it is the same one I used in this video

  • @jousty6696
    @jousty6696 Před 5 měsíci

    At 4:13 Name error: name ‘DeseqDataSet_init_’ is not definied

  • @NewPlant_
    @NewPlant_ Před 2 měsíci

    Hey man, I'm trying to use this video to help me with my own work but during the GSEA part where you are using the gp.prerank function, I keep getting a value error. ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements. I've done everything you have and my dataframe has two columns just like yours, I don't know why it's expecting 3.

    • @sanbomics
      @sanbomics  Před 2 měsíci

      It might be related to the service being down. It's an API call to their server and when it's down it returns not what is expected and you get a weird error. At least I know I have had to deal with this in the past

  • @user-rb2pt1zk1s
    @user-rb2pt1zk1s Před 4 měsíci

    Hello , Can you please share the source of the dataset ? which article did you use pleaaaase

    • @sanbomics
      @sanbomics  Před 4 měsíci +1

      Ooops I forgot to put that in the description. Thanks! The samples include normal human cell control and replicative senescence cells from NCBI accession GSE171663

  • @MegaLiaoliao
    @MegaLiaoliao Před 8 měsíci

    ValueError: The count matrix should only contain integers at 4:00 min

    • @sanbomics
      @sanbomics  Před 7 měsíci

      Try converting the whole matrix to int and see if you get an error