What batch correction was applied to pan-Cancer mRNA expression data?
2
1
Entering edit mode
2.8 years ago
user31888 ▴ 100

I would need to retrieve the normalisation (and maybe the batch correction method) used to produced the pan-Cancer Atlas mRNA expression matrix (file called 'EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv' found here).

Starting from the raw read counts obtained from the GDC and the same gene panel, I tried FPKM and FPKM-UQ normalisation as described here, but the expression values obtained do not fall at all in the same range as in the pan-Cancer mRNA matrix. Maybe that would suggest a cross-sample batch correction.

My goal is, starting from raw read counts, to normalise expression data from new samples together with the pan-Cancer mRNA data, in order to get a unified expression matrix and to be able to compare apples to apples basically.

Any information or alternative method would be greatly appreciated.

TCGA pan-Cancer Atlas mRNA normalisation • 3.1k views
0
Entering edit mode

whether the exp matrix log transformed?

4
Entering edit mode
2.8 years ago

My guess (and it is only a guess), given the name of the file, that this is build from the RSEM quantification results that are present in the Broad Institute's Firehose portal, rather than from read counts.

RSEM use an EM algorithm to build isofrom expression values. A length-weighted sum of these values is then used to create gene expression values.

The firehose documentation states that these are normalised like so:

RSEM expression estimates are normalized to set the upper quartile count at 1000 for gene level and 300 for isoform level estimates.

Please note that this is definately NOT a batch correction and that batch effects have been shown to be a serious problem with PanCancer analyses (although this is at the level of somatic variants)

0
Entering edit mode

Thanks @i.sudbery !

You are right, the expression data from the different TCGA cancer types have been obtained from Firehose pipelines and merged together to form the pan-Cancer Atlas expression matrix 'EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv'.

Looking at the pipelines used on Firehose ('MapspliceRSEM' here), it seems that RSEM was used for read quantification, then normalised by setting the upper quartile count to 1,000, as you mentionned.

However, when starting from read counts, I still cannot retrieve similar expression values using GetNormalizedMat, along with MedianNorm or QuantileNorm functions from the EBSeq package (manual here).

0
Entering edit mode

You will not be able to retrieve similar quantifications starting from read numbers and RSEM uses a fundementally different model to estimate expression compared to a read counting model.

0
Entering edit mode

@i.sudbery: That's right.

0
Entering edit mode
2.2 years ago
igor 12k

There is an explanation of EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv (should be the same as EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv) on TCGA PancanAtlas Synapse:

Contains batch normalized RNASeqV2 mRNA data.

20531 genes (rows) x 11069 samples (columns). ~1.6 GB file size.

1. All Hi-Seq data from UNC were unchanged, with the exception of PRAD (prostate)

2. All data from BCGSC, whether Hi-Seq or GA, were unchanged

3. PRAD batch IDs 312 and 320 were adjusted to remove batch effects. Remaining PRAD data were unchanged.

4. All GA samples from UNC were adjusted to remove platform effects between UNC Hi-Seq and GA samples. The tumor types containing UNC GA samples that were adjusted are UCEC, COAD, and READ.

5. Genes with mostly zero reads or with residual batch effects (approx. 2-3k or 10% of genes) were removed from the adjusted samples and replaced with NAs. No genes were removed from samples with "No Change" status.

6. Genes were adjusted using a novel algorithm called EB++; a variant of Empirical Bayes/ComBat algorithm with training/testing features added.