I am trying to determine if the RNA expression of my favorite gene is associated with reduced survival in patients.
I am really interested in how to incorporate better quality control in my workflow. I have searched and found not a lot of information on quality control for TCGA derived survival analysis.
A super summarized approach of what I am doing:
get raw gene counts from XENA
use only samples corresponding to tumor tissue
remove lowly expressed genes
transform/normalize counts using variance stabilizing transformation in DESeq2
code samples as “high expression” for my favorite gene if they are above the median expression level for that gene, otherwise code them “low expression”
Parse clinical data and do cox regression to determine if high expression leads to reduced survival (I don’t really have questions about the survival analysis itself)
My thoughts on quality control so far are to do a PCA plot after step 4 and remove any samples that don’t group with the population.
When I do that I get this
I think this is where the art kicks in, there is a group separate from the population in the upper right. But its a grouping, not one sample so I am hesitant to shave those away.
Any recommendations on my PCA approach or any other ways to incorporate quality control would be greatly appreciated. Thank you!