Question

Question about implementing quality control of TCGA data for survival analysis

0

Entering edit mode

4.7 years ago

curious ▴ 750

I am trying to determine if the RNA expression of my favorite gene is associated with reduced survival in patients.

I am really interested in how to incorporate better quality control in my workflow. I have searched and found not a lot of information on quality control for TCGA derived survival analysis.

A super summarized approach of what I am doing:

get raw gene counts from XENA
use only samples corresponding to tumor tissue
remove lowly expressed genes
transform/normalize counts using variance stabilizing transformation in DESeq2
code samples as “high expression” for my favorite gene if they are above the median expression level for that gene, otherwise code them “low expression”
Parse clinical data and do cox regression to determine if high expression leads to reduced survival (I don’t really have questions about the survival analysis itself)

My thoughts on quality control so far are to do a PCA plot after step 4 and remove any samples that don’t group with the population.

When I do that I get this

I think this is where the art kicks in, there is a group separate from the population in the upper right. But its a grouping, not one sample so I am hesitant to shave those away.

Any recommendations on my PCA approach or any other ways to incorporate quality control would be greatly appreciated. Thank you!

survival cox regression tcga • 1.8k views

ADD COMMENT • link updated 4.7 years ago by sim.j.baum ▴ 140 • written 4.7 years ago by curious ▴ 750

0

Entering edit mode

I would try to find out if this might come from a technical confounding effect such as different library prep kit or similar, or if this is rather biological variantion and cancer-subgroups. In e.g. Diffuse B Cell Lymphoma you have at least two major subgroups that arise from different cells-of-origin. In my limited experience with batch effects I found different kits to be a major confounding factor, e.g. a low-inout RNA-seq kit vs. standard TruSeq from Illumina. Is this information available for every of the samples? They should probably all be produced in the same fashion, but one never knows. Are they all paired-end or single-end (at least the same for all samples)? Without additional information it is hard to decide if this is a technical or biological issue.

ADD REPLY • link 4.7 years ago by ATpoint 81k

0

Entering edit mode

I was thinking about that a little bit too. I don't know if you are familiar with TCGA, but there is some batch information in the barcodes for each sample, including sequencing center which has been been a suspected source of pretty significant batch effects with TCGA data:

https://www.biorxiv.org/content/10.1101/445049v1.full

If I think multiple things (ie sequencing center or processing order) may be affecting this plot what is a good way to see which should be removed? Just remove them one by one and visually look at the PCA plot is how I would do it now. Maybe starting with whichever factor makes up the highest percentage of samples forming that subgroup.

These are pancreatic cancer samples, which have quite a few different subtypes, so I would not be surprised at all if that explains the data. Annotation for all the different subtypes is not great for TCGA data though unfortunately

ADD REPLY • link 4.7 years ago by curious ▴ 750

0

Entering edit mode

I played around with this more this morning, because it is interesting. I separated samples by sequencing center code and that does not seem to be it. The search continues!

ADD REPLY • link 4.7 years ago by curious ▴ 750

0

Entering edit mode

Are those samples also popping up as a separate group in the next principal components? Consider that you are looking at 16% of variance explained within the data. What kind of tumor types are you looking at? In prostate cancer you can have several sub-types (e.g.: TMPRSS2_ERG fusion). Or, are the samples metastatic vs non-metastatic? You might also want to have a look which genes are the main drivers for this drift. It depends on your final question, but i would not remove this group.

ADD REPLY • link 4.7 years ago by sim.j.baum ▴ 140

0

Entering edit mode

I am working with internal DESeq2 functions which are not super easy to get the other PCs with, but I am looking at that right now. What would it mean if they stay separated in PC2, PC3 for example? I am looking at pancreatic cancer tumors, which have many subtypes so that is what I am starting to think may be the cause. I have been looking at coding samples by technical handling but its not clear that is the issue. My guess is subtype so far. Also is the second component being responsible for 16% variance substantial? I am kind of new to principal component analysis

ADD REPLY • link 4.7 years ago by curious ▴ 750

0

Entering edit mode

"What would it mean if they stay separated in PC2, PC3 for example?" That there is a higher variance between this group of samples and the others - Arguing for a broader diversity and maybe multiple factors playing a role.
"many subtypes so that is what I am starting to think may be the cause" - Yes this will probably play a major role here.
You can have a look for "scree" plot explaining the PC and variance you observe. I find this explanation quite helpful:
http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/

ADD REPLY • link 4.7 years ago by sim.j.baum ▴ 140