Survival analysis with TCGA data in R | Create Kaplan-Meier Curves
Vložit
- čas přidán 29. 06. 2024
- In this video I talk about the concept of survival analysis, what questions does it help to answer and what data do we need to perform this analysis. I also discuss important concepts like censoring and how it is perform and explain how to interpret Kaplan-Meier curves. Lastly, I demonstrate how to perform survival analysis in R using survival and survminer packages.
I hope you find this video helpful! Leave your thoughts in the comment section below!
Link to Code:
github.com/kpatel427/CZcamsT...
How to download data from GDC portal?
• Download data from GDC...
How to convert gene IDs to symbols?
• 3 ways to convert Ense...
Chapters:
0:00 Intro
0:35 Intuition behind survival analysis
2:21 Why do we perform survival analysis?
3:57 What is Censoring and why is it important?
6:14 What is considered as an event?
6:35 Methods for survival analysis
8:03 How to read a Kaplan-Meier curve?
10:31 Question to answer using survival analysis
10:53 3 things required for survival analysis
12:08 Download clinical data from GDC portal
15:57 Getting status information and censoring data
17:31 Set up an “overall survival” (i.e. time) for each patient in the cohort
19:01 For event/strata information for each patient, fetch gene expression data from GDC portal
19:33 Build query using GDCquery()
22:45 Download data using GDCdownload()
23:14 Extract counts using GDCprepare()
25:07 Perform Variance Stabilization Transformation (vst) on counts before further analysis
27:38 Wrangle data to get the relevant data and data in the right shape
33:11 Approaches to divide cohort into 2 groups based on expression
34:41 Bifurcating patients into low and high TP53 expression groups
34:57 Define strata for each patient
38:41 Compute a survival curve using survfit() and creating a Kaplan-Meier curve using ggsruvplot()
41:30 survfit() vs survdiff()
You can show your support and encouragement by buying me a coffee:
www.buymeacoffee.com/bioinfor...
To get in touch:
Website: bioinformagician.org/
Github: github.com/kpatel427
Email: khushbu_p@hotmail.com
#bioinformagician #bioinformatics #survival #survminer #survivalanalysis #kaplanmeier #tcga #gdcportal #tcgaportal #nci #cran #bioconductor #funcotator #variantcalling #variants #gatk #vcf #gvcf #haplotype #alleles #geneticvariants #mutations #gff3 #gff #gtf #sam #bam #phred #fasta #fastq #singlecell #10X #ensembl #biomart #annotationdbi #annotables #affymetrix #microarray #affy #ncbi #genomics #beginners #tutorial #howto #omics #research #biology #GEO #rnaseq #ngs
Best channel for any bioinformatician ❤❤
As someone who doesn't have a degree in Bioinformatics I am truly able to appreciate these things. Never stop making these videos!!
This is literally exactly the resource I was looking for several months ago. Glad to finally have it now. It's so nice to have example code and clear explanation.
I'm going to do survival analysis tomorrow, and I found you updated this video, it's so so so helpful! You're my Godness😍😘
Hi, I appreciate your work. Thanks for making these videos
Thank you very much for this very informative tutorial. Very helpful indeed.
You are the best! Thanks
I can't agree more
Thank you so much for your amazing content. I just wanted to know how I could extract the TCGA counts for some non-coding regions specified in a bed file. Suggestions would be really helpful. Thanks!
Thanks so much. This video is really useful. Besides, how can we prepare data to combine different factors to draw forest plot or to construct risk models? Could you please share this similar R code? Thanks again!
Thank you for your video! I just have question, why did you extracted the unstranded counts, but not any other count type?
Thank you very much for the excellent presentation. I am relatively new to TCGA-based R analysis. I was wondering if I can apply the same process to plot survival curves for a particular mutation using SNV data, such as the effect of BRCA1 mutation on the overall survival of ovarian cancer patients. Are there any significant changes that I need to make in the workflow to achieve this?
Could you please make a video on integration of Chip-seq and RNA-seq data?
I definitely plan to! Please stay tuned :)
@@Bioinformagician Thank you! I appreciate that.
Thank you for this very helpful video.
If I want to know correlation (pearson R-value) between some genes in TCGA-Breast Cancer , do I have to use fpkm_unstrand? Could you make video about this?
Again, I really appreciate your effort!!
thanks. Could you please run a tut on combining Machine Learning in R and TCGA or cbioportal or Gdac or others? Thank you.
If you ever write a book, let me know cause I'll pay 2000 euro to get it hands down!
Greetings Miss Khusbu! Again a powerful video and it was really comprehensive. I have one question and will appreciate your guidance on it.
If we perform survival analysis on an RNA-seq data from TCGA, and let’s say the expression matrix has 20K genes and 200 patients. After survival analysis I found 30 genes that has significant survival difference. So I want to pursue further and perform a multivariate cox regression of these 30 genes. Now my confusion is that what expression matrix we should use in multivariate cox model. Should we reduce initial expression matrix to only 30 genes as variables(columns) and 200 patients (as rows) or should we use the original expression matrix (having 20K genes and 200 patients and only put 30 genes in the cox equation :
coxph(Surv(time, event) ~ gene1+ gene2 + gene3..+ gene 30 , data)).
Will highly appreciate your comment on that.
Thanks and keep doing the great work.
I don't recommend to reduce the matrix to 30 genes. You should use the entire dataset and provide 30 genes in cox equation. Also, check for multicollinearity between 30 genes, as correlations between genes can cause instability in model estimates. If collinearity is found, you should use feature selection methods to include most relevant and independent predictors in the model.
@@Bioinformagician Many Thanks. Highly appreciate your reply.
Can you please make video on top colleges of msc bioinformatics in India?
good
Have you ever heard or done MFA & mixOmics DIABLO analysis on TCGA data?
How do we change the number days upto which follow up is done? Say instead of 8000 days I want the data upto only 4000 days.
risk.table is showing the followinf error: Error: 'yaml_body' is not an exported object from 'namespace:xfun'. can you please help
Hello mam, I am getting pvalue = 47.07. results are not significant. how to solve this. what could be the reason for this
Hello didi .. I need to talk to you .. can you pls send ur contact details .. it's about my current project .. i have some questions based on bioinformatics
Dividing into groups is not good practice. Regression should be used. I did my whole thesis on this debate.
Why we usually chose “unstranded data” for analysis?? @bioinformagician @khushbu. Please do solve this query??
I chose unstranded data for demonstration purposes. If your data is generated using a stranded protocol, you should choose stranded or reverse stranded accordingly.
@@Bioinformagician thank you
@@Bioinformagician I had the same question as @Divya and i got my answer.