Question: Question about implementing quality control of TCGA data for survival analysis
gravatar for curious
16 months ago by
curious460 wrote:

I am trying to determine if the RNA expression of my favorite gene is associated with reduced survival in patients.

I am really interested in how to incorporate better quality control in my workflow. I have searched and found not a lot of information on quality control for TCGA derived survival analysis.

A super summarized approach of what I am doing:

  1. get raw gene counts from XENA

  2. use only samples corresponding to tumor tissue

  3. remove lowly expressed genes

  4. transform/normalize counts using variance stabilizing transformation in DESeq2

  5. code samples as “high expression” for my favorite gene if they are above the median expression level for that gene, otherwise code them “low expression”

  6. Parse clinical data and do cox regression to determine if high expression leads to reduced survival (I don’t really have questions about the survival analysis itself)

My thoughts on quality control so far are to do a PCA plot after step 4 and remove any samples that don’t group with the population.

When I do that I get this


I think this is where the art kicks in, there is a group separate from the population in the upper right. But its a grouping, not one sample so I am hesitant to shave those away.

Any recommendations on my PCA approach or any other ways to incorporate quality control would be greatly appreciated. Thank you!

regression survival cox tcga • 636 views
ADD COMMENTlink modified 16 months ago by sim.j.baum50 • written 16 months ago by curious460

I would try to find out if this might come from a technical confounding effect such as different library prep kit or similar, or if this is rather biological variantion and cancer-subgroups. In e.g. Diffuse B Cell Lymphoma you have at least two major subgroups that arise from different cells-of-origin. In my limited experience with batch effects I found different kits to be a major confounding factor, e.g. a low-inout RNA-seq kit vs. standard TruSeq from Illumina. Is this information available for every of the samples? They should probably all be produced in the same fashion, but one never knows. Are they all paired-end or single-end (at least the same for all samples)? Without additional information it is hard to decide if this is a technical or biological issue.

ADD REPLYlink modified 16 months ago • written 16 months ago by ATpoint41k

I was thinking about that a little bit too. I don't know if you are familiar with TCGA, but there is some batch information in the barcodes for each sample, including sequencing center which has been been a suspected source of pretty significant batch effects with TCGA data:

If I think multiple things (ie sequencing center or processing order) may be affecting this plot what is a good way to see which should be removed? Just remove them one by one and visually look at the PCA plot is how I would do it now. Maybe starting with whichever factor makes up the highest percentage of samples forming that subgroup.

These are pancreatic cancer samples, which have quite a few different subtypes, so I would not be surprised at all if that explains the data. Annotation for all the different subtypes is not great for TCGA data though unfortunately

ADD REPLYlink written 16 months ago by curious460

I played around with this more this morning, because it is interesting. I separated samples by sequencing center code and that does not seem to be it. The search continues!


ADD REPLYlink modified 16 months ago • written 16 months ago by curious460

Are those samples also popping up as a separate group in the next principal components? Consider that you are looking at 16% of variance explained within the data. What kind of tumor types are you looking at? In prostate cancer you can have several sub-types (e.g.: TMPRSS2_ERG fusion). Or, are the samples metastatic vs non-metastatic? You might also want to have a look which genes are the main drivers for this drift. It depends on your final question, but i would not remove this group.

ADD REPLYlink written 16 months ago by sim.j.baum50

I am working with internal DESeq2 functions which are not super easy to get the other PCs with, but I am looking at that right now. What would it mean if they stay separated in PC2, PC3 for example? I am looking at pancreatic cancer tumors, which have many subtypes so that is what I am starting to think may be the cause. I have been looking at coding samples by technical handling but its not clear that is the issue. My guess is subtype so far. Also is the second component being responsible for 16% variance substantial? I am kind of new to principal component analysis

ADD REPLYlink modified 16 months ago • written 16 months ago by curious460

"What would it mean if they stay separated in PC2, PC3 for example?" That there is a higher variance between this group of samples and the others - Arguing for a broader diversity and maybe multiple factors playing a role.
"many subtypes so that is what I am starting to think may be the cause" - Yes this will probably play a major role here.
You can have a look for "scree" plot explaining the PC and variance you observe. I find this explanation quite helpful:

ADD REPLYlink written 15 months ago by sim.j.baum50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1241 users visited in the last hour