Survival analysis with TCGA data in R | Create Kaplan-Meier Curves

Sdílet
Vložit
  • čas přidán 29. 06. 2024
  • In this video I talk about the concept of survival analysis, what questions does it help to answer and what data do we need to perform this analysis. I also discuss important concepts like censoring and how it is perform and explain how to interpret Kaplan-Meier curves. Lastly, I demonstrate how to perform survival analysis in R using survival and survminer packages.
    I hope you find this video helpful! Leave your thoughts in the comment section below!
    Link to Code:
    github.com/kpatel427/CZcamsT...
    How to download data from GDC portal?
    • Download data from GDC...
    How to convert gene IDs to symbols?
    • 3 ways to convert Ense...
    Chapters:
    0:00 Intro
    0:35 Intuition behind survival analysis
    2:21 Why do we perform survival analysis?
    3:57 What is Censoring and why is it important?
    6:14 What is considered as an event?
    6:35 Methods for survival analysis
    8:03 How to read a Kaplan-Meier curve?
    10:31 Question to answer using survival analysis
    10:53 3 things required for survival analysis
    12:08 Download clinical data from GDC portal
    15:57 Getting status information and censoring data
    17:31 Set up an “overall survival” (i.e. time) for each patient in the cohort
    19:01 For event/strata information for each patient, fetch gene expression data from GDC portal
    19:33 Build query using GDCquery()
    22:45 Download data using GDCdownload()
    23:14 Extract counts using GDCprepare()
    25:07 Perform Variance Stabilization Transformation (vst) on counts before further analysis
    27:38 Wrangle data to get the relevant data and data in the right shape
    33:11 Approaches to divide cohort into 2 groups based on expression
    34:41 Bifurcating patients into low and high TP53 expression groups
    34:57 Define strata for each patient
    38:41 Compute a survival curve using survfit() and creating a Kaplan-Meier curve using ggsruvplot()
    41:30 survfit() vs survdiff()
    You can show your support and encouragement by buying me a coffee:
    www.buymeacoffee.com/bioinfor...
    To get in touch:
    Website: bioinformagician.org/
    Github: github.com/kpatel427
    Email: khushbu_p@hotmail.com
    #bioinformagician #bioinformatics #survival #survminer #survivalanalysis #kaplanmeier #tcga #gdcportal #tcgaportal #nci #cran #bioconductor #funcotator #variantcalling #variants #gatk #vcf #gvcf #haplotype #alleles #geneticvariants #mutations #gff3 #gff #gtf #sam #bam #phred #fasta #fastq #singlecell #10X #ensembl #biomart #annotationdbi #annotables #affymetrix #microarray #affy #ncbi #genomics #beginners #tutorial #howto #omics #research #biology #GEO #rnaseq #ngs

Komentáře • 33

  • @shivanirai3626
    @shivanirai3626 Před 10 dny

    Best channel for any bioinformatician ❤❤

  • @preeti97rox
    @preeti97rox Před rokem +4

    As someone who doesn't have a degree in Bioinformatics I am truly able to appreciate these things. Never stop making these videos!!

  • @jordanfredette5090
    @jordanfredette5090 Před rokem

    This is literally exactly the resource I was looking for several months ago. Glad to finally have it now. It's so nice to have example code and clear explanation.

  • @MsZhang666
    @MsZhang666 Před rokem +2

    I'm going to do survival analysis tomorrow, and I found you updated this video, it's so so so helpful! You're my Godness😍😘

  • @codewithme_1988
    @codewithme_1988 Před rokem +1

    Hi, I appreciate your work. Thanks for making these videos

  • @amitrupani9898
    @amitrupani9898 Před rokem +1

    Thank you very much for this very informative tutorial. Very helpful indeed.

  • @PsycheSnacks657
    @PsycheSnacks657 Před rokem +3

    You are the best! Thanks

  • @prakrithi.p7033
    @prakrithi.p7033 Před 10 měsíci

    Thank you so much for your amazing content. I just wanted to know how I could extract the TCGA counts for some non-coding regions specified in a bed file. Suggestions would be really helpful. Thanks!

  • @user-mv7uw3dh5d
    @user-mv7uw3dh5d Před rokem +1

    Thanks so much. This video is really useful. Besides, how can we prepare data to combine different factors to draw forest plot or to construct risk models? Could you please share this similar R code? Thanks again!

  • @ezra47986
    @ezra47986 Před 10 dny

    Thank you for your video! I just have question, why did you extracted the unstranded counts, but not any other count type?

  • @madushanfernando6495
    @madushanfernando6495 Před 7 měsíci

    Thank you very much for the excellent presentation. I am relatively new to TCGA-based R analysis. I was wondering if I can apply the same process to plot survival curves for a particular mutation using SNV data, such as the effect of BRCA1 mutation on the overall survival of ovarian cancer patients. Are there any significant changes that I need to make in the workflow to achieve this?

  • @BilalAhmad-gb7ui
    @BilalAhmad-gb7ui Před rokem +6

    Could you please make a video on integration of Chip-seq and RNA-seq data?

  • @skim4901
    @skim4901 Před 10 měsíci

    Thank you for this very helpful video.
    If I want to know correlation (pearson R-value) between some genes in TCGA-Breast Cancer , do I have to use fpkm_unstrand? Could you make video about this?
    Again, I really appreciate your effort!!

  • @AyrodsGamgam
    @AyrodsGamgam Před rokem

    thanks. Could you please run a tut on combining Machine Learning in R and TCGA or cbioportal or Gdac or others? Thank you.

  • @stefanodidonato1284
    @stefanodidonato1284 Před 8 měsíci +1

    If you ever write a book, let me know cause I'll pay 2000 euro to get it hands down!

  • @reflections86
    @reflections86 Před rokem +1

    Greetings Miss Khusbu! Again a powerful video and it was really comprehensive. I have one question and will appreciate your guidance on it.
    If we perform survival analysis on an RNA-seq data from TCGA, and let’s say the expression matrix has 20K genes and 200 patients. After survival analysis I found 30 genes that has significant survival difference. So I want to pursue further and perform a multivariate cox regression of these 30 genes. Now my confusion is that what expression matrix we should use in multivariate cox model. Should we reduce initial expression matrix to only 30 genes as variables(columns) and 200 patients (as rows) or should we use the original expression matrix (having 20K genes and 200 patients and only put 30 genes in the cox equation :
    coxph(Surv(time, event) ~ gene1+ gene2 + gene3..+ gene 30 , data)).
    Will highly appreciate your comment on that.
    Thanks and keep doing the great work.

    • @Bioinformagician
      @Bioinformagician  Před rokem +1

      I don't recommend to reduce the matrix to 30 genes. You should use the entire dataset and provide 30 genes in cox equation. Also, check for multicollinearity between 30 genes, as correlations between genes can cause instability in model estimates. If collinearity is found, you should use feature selection methods to include most relevant and independent predictors in the model.

    • @reflections86
      @reflections86 Před rokem

      @@Bioinformagician Many Thanks. Highly appreciate your reply.

  • @ShubhamMaurya-ws5ly
    @ShubhamMaurya-ws5ly Před rokem

    Can you please make video on top colleges of msc bioinformatics in India?

  • @mugomuiruri2313
    @mugomuiruri2313 Před 7 měsíci

    good

  • @saeedjaanz
    @saeedjaanz Před rokem

    Have you ever heard or done MFA & mixOmics DIABLO analysis on TCGA data?

  • @user-yf4pn8bw9c
    @user-yf4pn8bw9c Před rokem

    How do we change the number days upto which follow up is done? Say instead of 8000 days I want the data upto only 4000 days.

  • @raresciencesimple5626

    risk.table is showing the followinf error: Error: 'yaml_body' is not an exported object from 'namespace:xfun'. can you please help

  • @shreyasharma8063
    @shreyasharma8063 Před rokem

    Hello mam, I am getting pvalue = 47.07. results are not significant. how to solve this. what could be the reason for this

  • @dwitiroy2700
    @dwitiroy2700 Před rokem

    Hello didi .. I need to talk to you .. can you pls send ur contact details .. it's about my current project .. i have some questions based on bioinformatics

  • @arpitmathur2933
    @arpitmathur2933 Před 11 měsíci

    Dividing into groups is not good practice. Regression should be used. I did my whole thesis on this debate.

  • @divyaagrawal6740
    @divyaagrawal6740 Před rokem

    Why we usually chose “unstranded data” for analysis?? @bioinformagician @khushbu. Please do solve this query??

    • @Bioinformagician
      @Bioinformagician  Před rokem +2

      I chose unstranded data for demonstration purposes. If your data is generated using a stranded protocol, you should choose stranded or reverse stranded accordingly.

    • @divyaagrawal6740
      @divyaagrawal6740 Před rokem

      @@Bioinformagician thank you

    • @saeedjaanz
      @saeedjaanz Před rokem +1

      ​@@Bioinformagician I had the same question as @Divya and i got my answer.