how to deal with batch effect in TCGA RNA-seq dataset
Entering edit mode
5.2 years ago
tujuchuanli ▴ 100


I did differentially expression (DE) analysis on TCGA datasets and identified DE genes by edgeR (DE genes between cancer and normal samples). During my paper review, one of reviewers raised the question which is how to deal with batch effect.

I checked the edgeR manual. It could deal with batch effect by adding batch into designed matrix, just like: “design <- model.matrix(~Batch+Treatment)” in section 3.4.3. However, the Batch should be specified by user, just like: “Batch <- factor(c(1,3,4,1,3,4))”. To achieve it, I must know which sample is belonged to which batch and it is unknown to me in TCGA datasets. Besides it also provide a function called “plotMDS” to check batch effects in the datasets. But I didn`t know how to interpret this plot properly.

Do you know how to deal with batch effect in TCGA RNA-seq datasets? Can you teach me how to identify batch effect in MDS plot?

Thanks in advance.

TCGA RNA-Seq batch-effect • 4.6k views
Entering edit mode

Can you show the plot?

How to add images to a Biostars post

Entering edit mode
5.2 years ago
dario.garvan ▴ 520

TCGA doesn't provide much useful information for doing quality control. You won't be able to input known batches. Another good approach is to use housekeeping genes as controls that should be made more similar between samples. RUVSeq has functions for estimating batch effects with such genes or using spiked-in molecules and integrates seamlessly with edgeR. See its vignette for examples of how to use it.

Entering edit mode
5.2 years ago

Unfortunately there are no spikeins in TCGA data so I would be carefull using RUVSeq as Dario suggest. There are however two other options:

  1. Either you can subset the TCGA data to only contain paired information (healty and tumor sample from same patient) and do a paired analysis. Since the paired samples are processed simultaneously it is quite difficult to imagine batch effect between those. If you wan to be really strict you can intersect the paired and unpaired (the analysis you already have) and only call the once identified in both analysis for significant. Due to the number of samples you will probably need to use voom + Limma and it's "duplicateCorrelation()" - see section 9.7 of the limma vignette.
  2. You can use SVA which can "identifying and estimating surrogate variables for unknown sources of variation" - aka the batch effects in the TCGA data which you can then take into account in your model.

Login before adding your answer.

Traffic: 2084 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6