Question: What is an optimal normalization approach for comparing GTEx RNASeq normal vs TCGA tumor samples?
gravatar for Samir
3.5 years ago by
United States
Samir150 wrote:

I am trying to compare RNA-seq read counts from normal tissue samples from GTEx to tissue-level matched TCGA tumor samples. Since TCGA has also a few normals, I am trying to merge those adjacent tissue normals (for solid tumors) with matched tissue from GTEx. While merging normals from GTEx with TCGA normals, I want to remove batch effects (? library preparation, protocol and sequencing platform between GTEx and TCGA, assuming intra-group variation is minimal)and make GTEx samples serve as normals for matching TCGA tumor types.

For breast cancer set, I have taken read count level data for total of 72 normals (6 from TCGA BRCA, remaining GTEx) and 100 BRCA tumor samples, ran DESeq2 as per In brief,

  1. merge count level data from GTEx and TCGA, keep only matching genes (gencode v19) in both sets.

  2. sample info has group factor with two levels: gtex (66) and pcawg (106) and sample_type factor with two levels: normal (72) and tumor (100).

  3. My DESeqDataset is like this.

dds <- DESeqDataSetFromMatrix(countData = pcawg_data,
               colData = pcawg_id,
               design = ~ group + sample_type)


dds <- dds[ rowSums(counts(dds)) > 1, ]

rld <- rlog(dds, blind = FALSE, fast = TRUE)
head(assay(rld), 3)

dds <- DESeq(dds, parallel = T)
  1. PCA plot following rlog transform were not able fix batch effect and TCGA normals are far distant from GTEx normals.

  1. I also tried correcting batch effect using surrogate variable analysis package, SVA as per but no luck.

mod <- model.matrix(~ sample_type, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0,

Surrogate Variables

This is true for a few other tumor types too.

PCA plots usinf rlogged data for five tumor types

I have read few of recently published best practices for RNA-seq normalization and will run RUVSeq for fpkm level data. However, this seems challenging to me given normal samples originating from two different studies and no common samples in between. Good to get comments from experts here and get optimal normalization approach, if any for these datasets.



rna-seq deseq2 sva normalization • 4.0k views
ADD COMMENTlink modified 20 months ago by mforde841.2k • written 3.5 years ago by Samir150
gravatar for syrttgump
20 months ago by
USA/Newark/New Jersey Institute of Technology
syrttgump30 wrote:

I found this: Which is a datasets of normalized TCGA and GTEx RNA-Seq data.

ADD COMMENTlink written 20 months ago by syrttgump30
gravatar for max.amicone
2.1 years ago by
max.amicone0 wrote:

Hello Samir! I'm facing the same problem you posted here. Did you find any optimal solution to yours? Thank you in advance, Massimo.

ADD COMMENTlink written 2.1 years ago by max.amicone0
gravatar for mforde84
20 months ago by
mforde841.2k wrote:

I recently did some normalization work on the harmonized COAD dataset. Granted it's just TCGA data, but the principles should the same when adding samples from other studies.

normalize_counts <- function(raw_counts, batch){
    y <- DGEList(counts = raw_counts)
    y <- y[!rowSums(y$counts == 0) == ncol(raw_counts),] 
    A <- aveLogCPM(y)
    y2 <- y[A>1,]
    y3 <- calcNormFactors(y2, method = "TMM")
    dge <- voomWithQualityWeights(y3, normalization="quantile", plot=FALSE)
    rbe <- removeBatchEffect(dge, batch)

Maybe this might work.

ADD COMMENTlink modified 20 months ago • written 20 months ago by mforde841.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2492 users visited in the last hour