Differential expression in Python with pyDESeq2
Vložit
- čas přidán 2. 07. 2024
- Analyze RNAseq counts data with a Python implementation of DESeq2. I cover basic differential expression analysis, PCA plots, GSEA, heatmaps, and volcano plots.
Github:
github.com/mousepixels/sanbom...
The samples include normal human cell control and replicative senescence cells from NCBI accession GSE171663
0:00 Intro
0:30 Differential expression
7:30 PCA
8:30 GSEA
11:04 Heatmap
15:10 Volcano - Věda a technologie
Fix: You should used the normalized counts layer for the PCA step.
Thanks! Could you put the code for it?
Should we specify sc.tl.pca(dds.layers["normed_counts"] or just adding the log1p layer to dds before the pca is enough?
@@sarap5548 did you know please how to fix it in the code ?
sc.pp.log1p(dds)
sc.tl.pca(dds)
sc.pl.pca(dds, color = 'Condition', size = 30)
Great video! I have been using pyDeseq2 and works great! Very cool the mapper function! Your videos are always very helpful, thanks a lot!
Thank you for the kind words :)
I watched a bunch of your videos. They are really helpful and well made. I like that you present both python and R. They are helping me get started in bioinformatics!
I am glad you have found them helpful!
AMAZING. Thank you
Hello, thank you so much for this great, informative tutorial! I'd like to ask what does the PCA plot mean in your example? Thank you.
Your video is super helpful! I am wondering is any pipeline for differential expression analysis for ST data, or we just use SC pipeline? Thanks!!
Thank you for this very helpful tutorial. is there a function for plotting the euclidean distance map for each sample?
I will click your video everyday to pump that viewed number.(lol)
Thank you!
Doing the lord's work
@@sanbomics thank you for your respectul endeavor and time!
Been a while man! Hope you've been well.
Been busy with werk and haven't been able to do one. But doing well, thanks!
Thank you for such a comprehensive tutorial. I'm totally new to Bioinformatics and I found your videos are very helpful. Could you please do a tutorial on how to get the count table and sample file from scratch in Python?? I would very appreaciated
Hi! You should be able to find guides on how to open csv with pandas. If you can load the counts into pandas you should be able to follow along with the rest
Hey! Many thanks for this tutorial!
i got a questions regarding the design factors, am interested in DE to check effect of AGE, while i want to control for SEX and also want to test for there interaction , do you know how to implement the interaction term in the design factors?
design_factors. = ["Age","Sex","Age x Sex"],?
continuous_factors=["AGE"]
cool
Thanks for the great tutorial. How can I get a data frame with the normalized counts for each gene in each sample?
Without checking.. I think it is in one of the layers of the dds object. Something like dds.layers['norm_counts'}
where can I download count table? in the GSE accesion we only have normalized counts or raw data
thanks !
You're welcome!
I love your videos, and I am deeply grateful for all the content, especially python related that you put out! Hope to see more in the future!!
What exactly does baseMean tell us? How would you set the threshold for it, as it could vary from dataset to dataset?
Also when you use scanpy for PCA, you didnt normalize your data, or log transform it. Are those not necessary when using sc.tl.pca?
Another approach that I have tried and cross verified with R, is to use vst transformation using the dds.vst function, and then use this vst transformed data to make PCA, do you think this would make your PCA plot look better?
Sorry, one last question: For GSEA, What would be the difference using stat for ranking vs sign(LFC)*-log10(padj) vs LFC*-log10(padj)
Also, pre_res.res2d gives a nice dataframe with everything :)
Thank you! Basemean is just the mean of all the samples for that given gene. Setting a threshold is somewhat arbitrary. You could do it based on the distribution but most people just pick some number between 10-100, unfortunately I haven't seen standardized guidelines. Thanks for pointing that out.. it is actually a mistake on my part. Should have specified the normed_counts layer. Stat vs other metrics for GSEA is another one of those things that hasn't really been standardized. I like stat more now because it doesn't require an extra line of code and it doesn't seem to be inflated as badly as p-value can be sometimes. I would expect the actual results to not differ much though. Wow, thanks for pointing out pre_res.res2d xD
I'm going to pin this message because it has several important points. Thanks again for commenting!
@@sanbomics I also liked the content. First of all, thanks for it. Regarding PCA, when I apply vst and then PCA in R, I got more meaningful PCA plot than the scanpy PCA without any transformation beforehand for the same bulk RNA-seq data at hand.
@@sanbomics By the way, I got error for GSEA. gp.prerank(rnk=ranking,
gene_sets=['GO_Biological_Process_2023', my_gene_set],
seed = 6, permutation_num=100) givesTypeError: ufunc 'isinf' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' on my side.
As a beginner in bioinformatics, I am watching the uploaded videos well. There are cases where TCGA data is used for cancer analysis at the single cell level, so if you have time, could you make a related video? Thank you always.
What analysis are you trying to do?
at the 'mapper' stage if your species is mouse do you change it to 'mouse'? What would be the correct format?
just make sure to specify species = mouse when you initialize the mapper the first time. Then you can use it over and over again without specifying
Thanks for this video. I was not getting the one in R. How do I get your contact?
You can send me a message on twitter!
Hey! Many thanks for this tutorial!
I don't know if something changed in the PyDeseq2 API but the line "stat_res.summary()" triggers this error:
"IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed"
I have spent about 1h now trying to troubleshoot it without any success.
I use the "count_table_for_deseq_example.csv" from your github, and I tried going throught the official "Getting Started" to end up with the same error...
Any idea what's going on ?
They have made a lot of changes since I made this video, to the point where I am thinking about doing a new one. I'm not sure I have seen that error at that specific point before, but I have had that same error somewhere else in the workflow. Off the top of my head I can't remember what I did to fix it. It may be that something is off in the dds object. Hard to say without taking a closer look. Let me know if you figure it out
@@sanbomics Thanks for your reply, and your videos in general!
I'm making a dds and I got a weird error of:
TypeError: DeseqDataSet.__init__() got an unexpected keyword argument 'clinical'. has this been updated?
I think it changed.I used it like this :
dds = DeseqDataSet(counts=counts,
metadata=metadata,
design_factors="Condition")
Yeah, sorry for the late reply. They changed clinical to metadata
thanks
No problem!
Thx a million for the video! I have a general question about Python vs. R. For people new to bioinformatics, R is usually the lang they start with. Many cool packages are written in Python (and more and more). Do you see benefits for bioinformatics newbies to adopt python? You have great videos on both langs, so I want to see your opinion.
You'll need to learn at least a little of both. But Python is the future in my opinion. I did a video on this a while back: czcams.com/video/vNH3HejAwSE/video.html
@@sanbomics Thanks for sharing! I rely heavily on R but have come across more and more often Python packages, especially in VDJ analysis. Got to seriously pick up the Python now!
Thank you for your great introduction. I have the questions in GSEA part. Though I could get ranking and wanna do all gene analysis in 9:40 min and always get the error like below.
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements
How can I solve this problem...?
Sometimes the server is down. Were you able to get it working?
Hi, can i compare 4 groups(you use 2 groups)? Or DESeq2 only for two? Thanks for answer.
You can only do pairwise comparisons. But if you want to group them you can. for example, 3 vs 1. Or 2 vs 2. Very dependent on what you are trying to accomplish.
A quick simple question, do you know if pydeseq2 is capable of processing multiple groups? since I haven't being able to find that in its official docs.
What exactly are you trying to do?
@@sanbomics Thank you for your reply. Let's say I have RNA counts data for 9 samples, including control(3 sample), treatment1(3 samples), treatment2(3 samples). Is it possible to transform 9 columns of counts data to expression matix data using pydeseq2. Could be a blatant rookie question( so ignore this would be totally understandable). Thank you very much for your time.
@@lly6115 @sanbomics I have been able to convert it into a single condition column, by joining the strings. Then I can specify which contrast I want to run and get my results.
what would be the difference in the code it i have "normalized" paired data?
whick steps should i neglect , whick steps should be edited to fit my case
If the data are already normalized you may not be able to use Deseq. I think limma works for that
My dataset is just samples and doesn't have any control. I have made the metadata dataframe with all S in the conditions, as this is my design but then the prgram spits an error and says :
ValueError: Factors should take at least two values, but Condition takes the single value '['S']'.
I know that in R the program can be bypassed by writting " design =~1 " and I would like to get help on how to do so in python
Thanks in advance
You don't need a control, but you do need more than one group to test. You need to specify the groups in the condition column
Thank you for the video tutorial. I am not able to install pydeseq2 modules using jupyter notebook. Could you provide me any details on how to solve this? I am getting an error like this : ModuleNotFoundError: No module named 'scipy.stats._boost.nct_ufunc'.
Thank you
I would try making a new environment with python=3.9 and reinstall your packages. Try to install them all in one command so that pip can do a better job avoiding conflicts: pip install a b c d e f ...
Thanks for your video . Sometimes I get a value error when I run dds.deseq2() , ValueError: The first guess on the deviance function returned a nan. This could be a boundary problem and should be reported. how do I fix it?
Hard to say without looking more closely. See if you filter more lowly expressed genes out if that gets rid of the error. For example, get rid of genes that aren't expressed in at least two samples
@@sanbomics Thank you. when I filter (counts = counts[counts.sum(axis = 1) > 5] ), it helps. Sometimes I play with different counts sum value. However, if I get too high of counts sum value, I get another error which says perfect partition error. is there any parameters I could use in dds.deseq2() function to force to run that overcomes any error? Thanks
Can you install pyDESeq2 into your google drive. Then, later, you just need to re-mount your google drive so that the pyDESeq2 will be available next time you use it without re-installation?
No idea, I haven't tried this. Sorry!
I am using your sanbomics. Unfortunately, you only has human and mouse as the species. I work with chicken data. What is your suggestion for alternative solution here?
I would stick to ensembl/biomart for now
Thank you very much for this great video. How can I get gene symbols of "Cow"?
You are doing DE analysis from cow samples? Interesting. Check ensembl for the species and get the gtf. The gtf will have ensembl id and gene symbol, you just have to parse the file
@@sanbomics Thank you very much. I did it, and it worked.
When define condition colum you jave 8 samples for which yoù put C and RS manually what to do when we have more no of samples for exapmle first 50 are diseased and last 40 are normal for that how to define condition column
You can do ['disease'] * 50 + ['normal']*40 or you can import the metadata from csv
can your sanbomic script/package be used for cuffdiff results (cufflink tool)?
Which part? The volcano plot?
@@sanbomics sorry for the late reply, for heatmap and volcano plot. Thanks!
Hey! I got an error saying 'DeseqStats.__init__() got an unexpected keyword argument 'n_cpus''. Could it be that it's an old argument?
Thank you!
Yeah possibly, it was a brand new package and they have been making a lot of changes. For example, they changed "clinical" to "metadata" since i've made the video too
Hey, do you know what is the new argument for 'n_cups'? I have the same issue
Can you cover batch correction in python?
Can pyDESeq2 be use for single-cell RNAseq instead of bulk?
Yup! I actually just did a pseudobulk video czcams.com/video/Ee0PQUwVH8Q/video.html
do you have a link to your sample count_table.csv ?
thanks!
There is a counts table on my github, but I don't remember if it is the same one I used in this video
At 4:13 Name error: name ‘DeseqDataSet_init_’ is not definied
Hey man, I'm trying to use this video to help me with my own work but during the GSEA part where you are using the gp.prerank function, I keep getting a value error. ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements. I've done everything you have and my dataframe has two columns just like yours, I don't know why it's expecting 3.
It might be related to the service being down. It's an API call to their server and when it's down it returns not what is expected and you get a weird error. At least I know I have had to deal with this in the past
Hello , Can you please share the source of the dataset ? which article did you use pleaaaase
Ooops I forgot to put that in the description. Thanks! The samples include normal human cell control and replicative senescence cells from NCBI accession GSE171663
ValueError: The count matrix should only contain integers at 4:00 min
Try converting the whole matrix to int and see if you get an error