Simple guide to GSEA and plotting in python

Sdílet
Vložit
  • čas přidán 31. 07. 2022
  • I show you how to do and plot GSEA using predefined gene ontology gene sets as well as custom user input genes in python. This easily integrates into other differential expression or single-cell analysis pipelines in python and only requires a few lines of code.
    Note: It is better practice to not filter out the non-DE genes. In this case, you will also need to lower the default permutation_num in the gp.prerank function to something like: permutation_num = 100
    Notebook:
    github.com/mousepixels/sanbom...
  • Věda a technologie

Komentáře • 29

  • @sanbomics
    @sanbomics  Před rokem +3

    Note on which genes to include:
    In the video I include only significant DE genes. This is not best practice, but was faster and required less memory. You SHOULD include all genes that had reasonable expression. A good threshold might be genes with a base mean of greater than 100. If you do include all genes (which you should) you will likely need to decrease the default number of permutations in the gp.prerank command by adding this argument: permutation_num = 100
    See notebook for example: github.com/mousepixels/sanbomics/blob/main/GSEA_in_python.ipynb

    • @aelnabwy
      @aelnabwy Před rokem

      How did you managed to get the result file?

  • @jhpa0308
    @jhpa0308 Před 10 měsíci

    Thanks tonz for your effort making these informative videos. Without your help, I wouldn't be able to learn such nice and easy way of making GSEA plots. I really like every of your video and every single second of it. Really appreciate your smart and clear explaination.
    I have one question, is there anyway to plot GSEA with all the genesets listed at once?
    And per se rate them in order of significance?

    • @sanbomics
      @sanbomics  Před 10 měsíci +1

      You mean multiple lines on one chart? Or plot many individual plots?

    • @jhpa0308
      @jhpa0308 Před 10 měsíci

      @@sanbomics wow… i meant individuals originally but now also wanna know how to plot multiple, thanks for the response!

  • @katarinavalentincic9621

    HI! Thank you for another great video! Can you also use gene sets from msigdb for this? I have them downloaded in visual code in a separate folder, I am working on a server. I wanted to use msigdb package in R with rpy2 , but I have too many problems with setting up an environment, that's why I gave up on R.

    • @sanbomics
      @sanbomics  Před rokem +1

      you can run any set of genes you can get into a python dictionary/list. Check the available databases first and see if it already exists. But if it doesn't-- import your data into a dictionary format, eg, gene_sets = {'term 1':['gene1', 'gene2' ...] , 'term 2':['gene1', 'gene2' ...], etc } then put the dictionary into the gene set argument in the gseapy function

  • @charliewhittaker817
    @charliewhittaker817 Před rokem +2

    Thank you for this video and all of your other videos, they are very concise and educational.
    In this video, you pre-filter the genes in your ranked list. According to my understanding, this should not be done. I think that the ranked list should include all the genes in the experiment that have any evidence of expression. In other words, if it is possible to calculate a ranking metric, the gene should be included. Can you comment on the decision to use only deferentially expressed genes?
    I think this idea actually illustrates one of the strengths of GSEA in that it is possible to make biological interpretations in the absence of genes that pass thresholds that define differential expression.
    I was also wondering about your ranking metric itself. You are using log2FoldChange*-log10(adjp) which is certainly reasonable. I normally use the stat column for this which is something like log2FoldChange/StdError. Could you comment on your choice of ranking metric? Thanks!

    • @sanbomics
      @sanbomics  Před rokem +2

      Thanks for catching this. You raise very important points:
      1) rerunning the notebook I realized why I only used the DE genes. With 1000 default permutations my 128 gb computer runs out of memory and it is very slow. It is better to include all genes like you point out. The results likely aren't much different since the non-DE genes are weighted so little, but it is better to include them all. In this case you will likely have to reduce the number of permutations. "permutation_num" argmuent in gp.prerank. I have included a correction in the video description and I will push a new notebook.
      2) I'm not aware of a a definitive best metric to use to rank the genes. Stat is likely a very good choice. But p value and lfc are more universal. e.g., some single-cell programs do not give you a stat. I think both p and lfc individually may not be ideal, so I had learned to combine them through multiplication. Not sure if this is the BEST way, but seems to work well.
      Thanks again!

  • @harryliu1005
    @harryliu1005 Před 10 měsíci

    Great Video! Can you explain the formula for Rank? why the log10(df.padj) should be negtive?

    • @sanbomics
      @sanbomics  Před 10 měsíci

      Thanks :) probably just easier to rank on stat. It needs to be negative to make it positive.

  • @cherhan9817
    @cherhan9817 Před rokem +1

    Thanks, it is super useful. But when I run gseapy.prerank(....), got the error "No gene sets passed through filtering condition", even try different min_size and permutation_num. Do you know where is wrong?

    • @sanbomics
      @sanbomics  Před rokem

      There are several things that could have gone wrong. The two most likely are either your keys don't match the ones in the database (e.g, they are case sensitive) or you don't have any enriched pathways. Its hard to tell without looking at it more closely.

    • @cherhan9817
      @cherhan9817 Před rokem

      @@sanbomics Thanks for the reply. Yes, the reason is that my data has no enriched GO terms. (I was stupid to use a small test dataset, and misunderstood the returned error.)

  • @EamonCoughlan-rd5kk
    @EamonCoughlan-rd5kk Před 2 měsíci

    Is it possible to adjust the text font size etc., and/or remove the redundant FDR value when running on a single geneset of interest? My default text when plotting with gseapy is much blockier and often overlaps the graph in an ugly way.

    • @sanbomics
      @sanbomics  Před měsícem

      I haven't tried, but you may be able to control it be changing the matplotlib default settings. Im not sure gsea returns an axis so it may not be super easy to customize after. Were you able to find a solution?

    • @EamonCoughlan-rd5kk
      @EamonCoughlan-rd5kk Před měsícem

      @@sanbomics A fairly crude one, for now I just save the plots as pdf and then edit fonts etc. with a pdf editor.

  • @harryliu1005
    @harryliu1005 Před 10 měsíci

    I need your help! :), when I run this: gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph]), the error reports says TypeError: gseaplot() got multiple values for argument 'term'

    • @sanbomics
      @sanbomics  Před 10 měsíci +2

      print term_to_graph and see what it is

    • @harryliu1005
      @harryliu1005 Před 10 měsíci

      it prints 'mitotic spindle organization (GO:0007052)'@@sanbomics

    • @csalt3689
      @csalt3689 Před 10 měsíci +4

      Try this: gseaplot(rank_metric=pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph])

    • @harryliu1005
      @harryliu1005 Před 10 měsíci +1

      wow! It works! Thank a lot !! @@csalt3689

    • @user-mg6nt5pg6x
      @user-mg6nt5pg6x Před 4 měsíci

      @@csalt3689 You are a lifesaver! Thank you.

  • @user-jq2di6cl8m
    @user-jq2di6cl8m Před 4 měsíci +1

    Thank you for the great presentation.
    I have an error in the very last code
    no differences in front of these code
    term_to_graph
    > 'GO_Biological_Process_2021__cellular response to DNA damage stimulus (GO:0006974)'
    gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph])
    > ---------------------------------------------------------------------------
    TypeError Traceback (most recent call last)
    Cell In[122], line 1
    ----> 1 gseaplot(pre_res.ranking, term = term_to_graph, **pre_res.results[term_to_graph])
    TypeError: gseaplot() got multiple values for argument 'term'
    I cannot solve this error. if possible, could you check this?

    • @user-wg3ch5uf2h
      @user-wg3ch5uf2h Před 3 měsíci

      I have the exact same error over and over again no matter what I correct! Did you maybe manage to solve it?

    • @user-jq2di6cl8m
      @user-jq2di6cl8m Před 3 měsíci +1

      @@user-wg3ch5uf2h yes, i solve the problem.
      could you explain you code more detail or send me your script?

    • @user-wg3ch5uf2h
      @user-wg3ch5uf2h Před 3 měsíci

      @@user-jq2di6cl8m Yes of course! How can I contact you?

    • @user-wg3ch5uf2h
      @user-wg3ch5uf2h Před 3 měsíci

      @@user-jq2di6cl8m Yes of course! How can I contact you?

    • @user-wg3ch5uf2h
      @user-wg3ch5uf2h Před 3 měsíci

      @@user-jq2di6cl8m Yes of course! How can I contact you?