Guide to filtering and subsetting single-cell anndata and pandas objects | basic and advanced
Vložit
- čas přidán 19. 07. 2024
- Manipulating the anndata object is fundamental to single-cell analysis using scanpy in python. I show several basic and advanced methods to filter and subset your single-cell data based on different scenarios. These skills are necessary for single-cell and general data analysis in python.
0:00 Intro
0:31 Basic filtering
6:45 Custom filtering
9:50 Gene-based filtering - Věda a technologie
very helpful,thanks!
You're welcome!
Thanks for your great video, very practical! I have a little doubt. Can you help me solve it?
When I subdivide the T cell cluster, how can the new tags be passed back to the initial adata?
I imagined merge the original adata except T cells with the subdivided T cell adata, but it seemed clumsy. Do you have any good methods?
Just like 'sce$celltype[match(colnames(TNK),colnames(sce))] = TNK$celltype ' in Seurat.
I really like using dictionaries and the .map() function on a pandas column to return a new column
Awesome 👍
Thanks!
Hi! I have processed and clustered my cells (adata), and subsetted the clusters to remove one contaminating cell cluster (adata_subset). However, I would now like to reprocess and recluster the clusters that I have chosen to subset.This requires going back to the beginning with the raw counts, etc. Which I am unsure how to do.
Before processing the original dataset, I created a layer: adata.layers[“counts”] = adata.X.copy(), so that I could save the raw counts here. I made adata.raw the normalized and log scaled version of the adata during my clustering (as suggested in the scanpy tutorial), so calling raw.to_adata() will not help me to get the raw counts for downstream analysis (this is mentioned in an answer to a similar question).
I guess I am wondering how I go about using/calling the counts layer that has my raw counts in the adata_subset object for reprocessing and recluster analysis?
Thank you!!
Hi! What you can do is reload the original data. Then subset the original data by the obs.index (cell id). It will be something like:
adata[adata.obs.index.isin(adata_subset.obs.index)] Let me know if that doesn't work
Hi, is there a way to subset adata on genes by using an external list of gene names?
Yup! Just get it in as a list somehow then you can subset it like adata = adata[:, the_list]
thank you for your great video!
I have a question.
I want to divide T cell(CD3E positive) into CD4-posi(CD8 nega) and CD8-posi(CD4 nega).
①First, I preprocessed my data(qc, Dimensionality reduction, clustering).
②Next, I converted my data into raw data
(code)
adata = adata.raw.to_adata()
gene_loc1=np.where(adata.var_names=='CD3E')[0][0]
gene_loc1
gene_loc2=np.where(adata.var_names=='CD4')[0][0]
gene_loc2
gene_loc3=np.where(adata.var_names=='CD8A')[0][0]
gene_loc3
adata[(adata.X[:,gene_loc1].toarray()>0)&(adata.X[:,gene_loc2].toarray()>0)&(adata.X[:,gene_loc3].toarray()==0)]
③Finally, I can get CD3E posi, CD4 posi, CD8 nega cells.
However, I enter the code below, a large number of CD4-negative and CD8-negative cells are detected.
Theoretically, there should be very few CD4-negative, CD8-negative cells.
Is there a problem somewhere?
(code)
adata[(adata.X[:,gene_loc1].toarray()>0)&(adata.X[:,gene_loc2].toarray()==0)&(adata.X[:,gene_loc3].toarray()==0)]
The problem is you are going to get some expression in other cells. Because single-cell is inherently noisy, it would be better to annotate cd4/cd8 at the population level and not on an individual cell level. I would reduce the resolution of your clustering until you see a clear split between cd4/cd8 then annotate them as such. If you don't get nice separation with your data you can instead annotate all T cells then subset them out and apply what you did above just to that subset.
@@sanbomics
I will try again based on your advice.
I do single cell analysis with scanpy and I'm always looking forward to your videos.
Thank you!!
No problem! Let me know how it turns out
Hi! Im memory limited, so I can only load in my dataset using the backed = 'r' option. How would I subset in this scenario?
I avoid backed at all costs haha. I know I have had to do this before.. but it is so infrequent that I don't remember how off the top of my head and I don't remember where I can find an example. Hope you figured it out, sorry for slow response