Question: how to deal with batch effect in TCGA RNA-seq dataset
gravatar for tujuchuanli
14 months ago by
tujuchuanli60 wrote:


I did differentially expression (DE) analysis on TCGA datasets and identified DE genes by edgeR (DE genes between cancer and normal samples). During my paper review, one of reviewers raised the question which is how to deal with batch effect.

I checked the edgeR manual. It could deal with batch effect by adding batch into designed matrix, just like: “design <- model.matrix(~Batch+Treatment)” in section 3.4.3. However, the Batch should be specified by user, just like: “Batch <- factor(c(1,3,4,1,3,4))”. To achieve it, I must know which sample is belonged to which batch and it is unknown to me in TCGA datasets. Besides it also provide a function called “plotMDS” to check batch effects in the datasets. But I didn`t know how to interpret this plot properly.

Do you know how to deal with batch effect in TCGA RNA-seq datasets? Can you teach me how to identify batch effect in MDS plot?

Thanks in advance.

batch effect rna-seq tcga • 1.2k views
ADD COMMENTlink modified 14 months ago by kristoffer.vittingseerup3.2k • written 14 months ago by tujuchuanli60

Can you show the plot?

How to add images to a Biostars post

ADD REPLYlink written 14 months ago by ATpoint34k
gravatar for dario.garvan
14 months ago by
dario.garvan460 wrote:

TCGA doesn't provide much useful information for doing quality control. You won't be able to input known batches. Another good approach is to use housekeeping genes as controls that should be made more similar between samples. RUVSeq has functions for estimating batch effects with such genes or using spiked-in molecules and integrates seamlessly with edgeR. See its vignette for examples of how to use it.

ADD COMMENTlink written 14 months ago by dario.garvan460
gravatar for kristoffer.vittingseerup
14 months ago by
European Union
kristoffer.vittingseerup3.2k wrote:

Unfortunately there are no spikeins in TCGA data so I would be carefull using RUVSeq as Dario suggest. There are however two other options:

  1. Either you can subset the TCGA data to only contain paired information (healty and tumor sample from same patient) and do a paired analysis. Since the paired samples are processed simultaneously it is quite difficult to imagine batch effect between those. If you wan to be really strict you can intersect the paired and unpaired (the analysis you already have) and only call the once identified in both analysis for significant. Due to the number of samples you will probably need to use voom + Limma and it's "duplicateCorrelation()" - see section 9.7 of the limma vignette.
  2. You can use SVA which can "identifying and estimating surrogate variables for unknown sources of variation" - aka the batch effects in the TCGA data which you can then take into account in your model.
ADD COMMENTlink written 14 months ago by kristoffer.vittingseerup3.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1759 users visited in the last hour