What is an optimal normalization approach for comparing GTEx RNASeq normal vs TCGA tumor samples?
Entering edit mode
8.2 years ago
Samir ▴ 200

I am trying to compare RNA-seq read counts from normal tissue samples from GTEx to tissue-level matched TCGA tumor samples. Since TCGA has also a few normals, I am trying to merge those adjacent tissue normals (for solid tumors) with matched tissue from GTEx. While merging normals from GTEx with TCGA normals, I want to remove batch effects (? library preparation, protocol and sequencing platform between GTEx and TCGA, assuming intra-group variation is minimal)and make GTEx samples serve as normals for matching TCGA tumor types.

For breast cancer set, I have taken read count level data for total of 72 normals (6 from TCGA BRCA, remaining GTEx) and 100 BRCA tumor samples, ran DESeq2 as per http://www.bioconductor.org/help/workflows/rnaseqGene/#construct In brief,

  1. merge count level data from GTEx and TCGA, keep only matching genes (gencode v19) in both sets.

  2. sample info has group factor with two levels: gtex (66) and pcawg (106) and sample_type factor with two levels: normal (72) and tumor (100).

  3. My DESeqDataset is like this.

dds <- DESeqDataSetFromMatrix(countData = pcawg_data,
               colData = pcawg_id,
               design = ~ group + sample_type)


dds <- dds[ rowSums(counts(dds)) > 1, ]

rld <- rlog(dds, blind = FALSE, fast = TRUE)
head(assay(rld), 3)

dds <- DESeq(dds, parallel = T)
  1. PCA plot following rlog transform were not able fix batch effect and TCGA normals are far distant from GTEx normals.

  1. I also tried correcting batch effect using surrogate variable analysis package, SVA as per http://www.bioconductor.org/help/workflows/rnaseqGene/#batch but no luck.

mod <- model.matrix(~ sample_type, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)

Surrogate Variables

This is true for a few other tumor types too.

PCA plots usinf rlogged data for five tumor types

I have read few of recently published best practices for RNA-seq normalization and will run RUVSeq for fpkm level data. However, this seems challenging to me given normal samples originating from two different studies and no common samples in between. Good to get comments from experts here and get optimal normalization approach, if any for these datasets.



RNA-Seq normalization DESeq2 SVA • 7.4k views
Entering edit mode
6.4 years ago
syrttgump ▴ 50

I found this: https://github.com/mskcc/RNAseqDB Which is a datasets of normalized TCGA and GTEx RNA-Seq data.

Entering edit mode
6.8 years ago

Hello Samir! I'm facing the same problem you posted here. Did you find any optimal solution to yours? Thank you in advance, Massimo.

Entering edit mode
6.4 years ago
mforde84 ★ 1.4k

I recently did some normalization work on the harmonized COAD dataset. Granted it's just TCGA data, but the principles should the same when adding samples from other studies.

normalize_counts <- function(raw_counts, batch){
    y <- DGEList(counts = raw_counts)
    y <- y[!rowSums(y$counts == 0) == ncol(raw_counts),] 
    A <- aveLogCPM(y)
    y2 <- y[A>1,]
    y3 <- calcNormFactors(y2, method = "TMM")
    dge <- voomWithQualityWeights(y3, normalization="quantile", plot=FALSE)
    rbe <- removeBatchEffect(dge, batch)

Maybe this might work.


Login before adding your answer.

Traffic: 983 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6