Guide to filtering and subsetting single-cell anndata and pandas objects | basic and advanced

Sdílet
Vložit
  • čas přidán 19. 07. 2024
  • Manipulating the anndata object is fundamental to single-cell analysis using scanpy in python. I show several basic and advanced methods to filter and subset your single-cell data based on different scenarios. These skills are necessary for single-cell and general data analysis in python.
    0:00 Intro
    0:31 Basic filtering
    6:45 Custom filtering
    9:50 Gene-based filtering
  • Věda a technologie

Komentáře • 16

  • @user-lv6rj7mc8s
    @user-lv6rj7mc8s Před 2 lety

    very helpful,thanks!

  • @dacyma3442
    @dacyma3442 Před rokem

    Thanks for your great video, very practical! I have a little doubt. Can you help me solve it?
    When I subdivide the T cell cluster, how can the new tags be passed back to the initial adata?
    I imagined merge the original adata except T cells with the subdivided T cell adata, but it seemed clumsy. Do you have any good methods?
    Just like 'sce$celltype[match(colnames(TNK),colnames(sce))] = TNK$celltype ' in Seurat.

    • @sanbomics
      @sanbomics  Před rokem

      I really like using dictionaries and the .map() function on a pandas column to return a new column

  • @mst63th
    @mst63th Před 2 lety

    Awesome 👍

  • @oliviaringham8706
    @oliviaringham8706 Před rokem

    Hi! I have processed and clustered my cells (adata), and subsetted the clusters to remove one contaminating cell cluster (adata_subset). However, I would now like to reprocess and recluster the clusters that I have chosen to subset.This requires going back to the beginning with the raw counts, etc. Which I am unsure how to do.
    Before processing the original dataset, I created a layer: adata.layers[“counts”] = adata.X.copy(), so that I could save the raw counts here. I made adata.raw the normalized and log scaled version of the adata during my clustering (as suggested in the scanpy tutorial), so calling raw.to_adata() will not help me to get the raw counts for downstream analysis (this is mentioned in an answer to a similar question).
    I guess I am wondering how I go about using/calling the counts layer that has my raw counts in the adata_subset object for reprocessing and recluster analysis?
    Thank you!!

    • @sanbomics
      @sanbomics  Před rokem +1

      Hi! What you can do is reload the original data. Then subset the original data by the obs.index (cell id). It will be something like:
      adata[adata.obs.index.isin(adata_subset.obs.index)] Let me know if that doesn't work

  • @suryakoturan7832
    @suryakoturan7832 Před rokem

    Hi, is there a way to subset adata on genes by using an external list of gene names?

    • @sanbomics
      @sanbomics  Před rokem

      Yup! Just get it in as a list somehow then you can subset it like adata = adata[:, the_list]

  • @moni-uh7dy
    @moni-uh7dy Před rokem

    thank you for your great video!
    I have a question.
    I want to divide T cell(CD3E positive) into CD4-posi(CD8 nega) and CD8-posi(CD4 nega).
    ①First, I preprocessed my data(qc, Dimensionality reduction, clustering).
    ②Next, I converted my data into raw data
    (code)
    adata = adata.raw.to_adata()
    gene_loc1=np.where(adata.var_names=='CD3E')[0][0]
    gene_loc1
    gene_loc2=np.where(adata.var_names=='CD4')[0][0]
    gene_loc2
    gene_loc3=np.where(adata.var_names=='CD8A')[0][0]
    gene_loc3
    adata[(adata.X[:,gene_loc1].toarray()>0)&(adata.X[:,gene_loc2].toarray()>0)&(adata.X[:,gene_loc3].toarray()==0)]
    ③Finally, I can get CD3E posi, CD4 posi, CD8 nega cells.
    However, I enter the code below, a large number of CD4-negative and CD8-negative cells are detected.
    Theoretically, there should be very few CD4-negative, CD8-negative cells.
    Is there a problem somewhere?
    (code)
    adata[(adata.X[:,gene_loc1].toarray()>0)&(adata.X[:,gene_loc2].toarray()==0)&(adata.X[:,gene_loc3].toarray()==0)]

    • @sanbomics
      @sanbomics  Před rokem

      The problem is you are going to get some expression in other cells. Because single-cell is inherently noisy, it would be better to annotate cd4/cd8 at the population level and not on an individual cell level. I would reduce the resolution of your clustering until you see a clear split between cd4/cd8 then annotate them as such. If you don't get nice separation with your data you can instead annotate all T cells then subset them out and apply what you did above just to that subset.

    • @moni-uh7dy
      @moni-uh7dy Před rokem

      @@sanbomics
      I will try again based on your advice.
      I do single cell analysis with scanpy and I'm always looking forward to your videos.
      Thank you!!

    • @sanbomics
      @sanbomics  Před rokem

      No problem! Let me know how it turns out

  • @qwerty11111122
    @qwerty11111122 Před 2 měsíci

    Hi! Im memory limited, so I can only load in my dataset using the backed = 'r' option. How would I subset in this scenario?

    • @sanbomics
      @sanbomics  Před měsícem

      I avoid backed at all costs haha. I know I have had to do this before.. but it is so infrequent that I don't remember how off the top of my head and I don't remember where I can find an example. Hope you figured it out, sorry for slow response