Question: What is an optimal normalization approach for comparing GTEx RNASeq normal vs TCGA tumor samples?
gravatar for Samir
5.4 years ago by
United States
Samir180 wrote:

I am trying to compare RNA-seq read counts from normal tissue samples from GTEx to tissue-level matched TCGA tumor samples. Since TCGA has also a few normals, I am trying to merge those adjacent tissue normals (for solid tumors) with matched tissue from GTEx. While merging normals from GTEx with TCGA normals, I want to remove batch effects (? library preparation, protocol and sequencing platform between GTEx and TCGA, assuming intra-group variation is minimal)and make GTEx samples serve as normals for matching TCGA tumor types.

For breast cancer set, I have taken read count level data for total of 72 normals (6 from TCGA BRCA, remaining GTEx) and 100 BRCA tumor samples, ran DESeq2 as per In brief,

  1. merge count level data from GTEx and TCGA, keep only matching genes (gencode v19) in both sets.

  2. sample info has group factor with two levels: gtex (66) and pcawg (106) and sample_type factor with two levels: normal (72) and tumor (100).

  3. My DESeqDataset is like this.

dds <- DESeqDataSetFromMatrix(countData = pcawg_data,
               colData = pcawg_id,
               design = ~ group + sample_type)


dds <- dds[ rowSums(counts(dds)) > 1, ]

rld <- rlog(dds, blind = FALSE, fast = TRUE)
head(assay(rld), 3)

dds <- DESeq(dds, parallel = T)
  1. PCA plot following rlog transform were not able fix batch effect and TCGA normals are far distant from GTEx normals.

  1. I also tried correcting batch effect using surrogate variable analysis package, SVA as per but no luck.

mod <- model.matrix(~ sample_type, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0,

Surrogate Variables

This is true for a few other tumor types too.

PCA plots usinf rlogged data for five tumor types

I have read few of recently published best practices for RNA-seq normalization and will run RUVSeq for fpkm level data. However, this seems challenging to me given normal samples originating from two different studies and no common samples in between. Good to get comments from experts here and get optimal normalization approach, if any for these datasets.



rna-seq deseq2 sva normalization • 5.7k views
ADD COMMENTlink modified 3.6 years ago by mforde841.3k • written 5.4 years ago by Samir180
gravatar for syrttgump
3.6 years ago by
USA/Newark/New Jersey Institute of Technology
syrttgump40 wrote:

I found this: Which is a datasets of normalized TCGA and GTEx RNA-Seq data.

ADD COMMENTlink written 3.6 years ago by syrttgump40
gravatar for max.amicone
4.0 years ago by
max.amicone0 wrote:

Hello Samir! I'm facing the same problem you posted here. Did you find any optimal solution to yours? Thank you in advance, Massimo.

ADD COMMENTlink written 4.0 years ago by max.amicone0
gravatar for mforde84
3.6 years ago by
mforde841.3k wrote:

I recently did some normalization work on the harmonized COAD dataset. Granted it's just TCGA data, but the principles should the same when adding samples from other studies.

normalize_counts <- function(raw_counts, batch){
    y <- DGEList(counts = raw_counts)
    y <- y[!rowSums(y$counts == 0) == ncol(raw_counts),] 
    A <- aveLogCPM(y)
    y2 <- y[A>1,]
    y3 <- calcNormFactors(y2, method = "TMM")
    dge <- voomWithQualityWeights(y3, normalization="quantile", plot=FALSE)
    rbe <- removeBatchEffect(dge, batch)

Maybe this might work.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by mforde841.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1475 users visited in the last hour