Question: What batch correction was applied to pan-Cancer mRNA expression data?
gravatar for user31888
16 months ago by
United States
user3188880 wrote:

I would need to retrieve the normalisation (and maybe the batch correction method) used to produced the pan-Cancer Atlas mRNA expression matrix (file called 'EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv' found here).

Starting from the raw read counts obtained from the GDC and the same gene panel, I tried FPKM and FPKM-UQ normalisation as described here, but the expression values obtained do not fall at all in the same range as in the pan-Cancer mRNA matrix. Maybe that would suggest a cross-sample batch correction.

My goal is, starting from raw read counts, to normalise expression data from new samples together with the pan-Cancer mRNA data, in order to get a unified expression matrix and to be able to compare apples to apples basically.

Any information or alternative method would be greatly appreciated.

ADD COMMENTlink modified 10 months ago by igor10k • written 16 months ago by user3188880
gravatar for i.sudbery
16 months ago by
Sheffield, UK
i.sudbery7.7k wrote:

My guess (and it is only a guess), given the name of the file, that this is build from the RSEM quantification results that are present in the Broad Institute's Firehose portal, rather than from read counts.

RSEM use an EM algorithm to build isofrom expression values. A length-weighted sum of these values is then used to create gene expression values.

The firehose documentation states that these are normalised like so:

RSEM expression estimates are normalized to set the upper quartile count at 1000 for gene level and 300 for isoform level estimates.

Please note that this is definately NOT a batch correction and that batch effects have been shown to be a serious problem with PanCancer analyses (although this is at the level of somatic variants)

ADD COMMENTlink written 16 months ago by i.sudbery7.7k

Thanks @i.sudbery !

You are right, the expression data from the different TCGA cancer types have been obtained from Firehose pipelines and merged together to form the pan-Cancer Atlas expression matrix 'EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv'.

Looking at the pipelines used on Firehose ('MapspliceRSEM' here), it seems that RSEM was used for read quantification, then normalised by setting the upper quartile count to 1,000, as you mentionned.

However, when starting from read counts, I still cannot retrieve similar expression values using GetNormalizedMat, along with MedianNorm or QuantileNorm functions from the EBSeq package (manual here).

ADD REPLYlink modified 16 months ago • written 16 months ago by user3188880

You will not be able to retrieve similar quantifications starting from read numbers and RSEM uses a fundementally different model to estimate expression compared to a read counting model.

ADD REPLYlink written 16 months ago by i.sudbery7.7k

@i.sudbery: That's right.

ADD REPLYlink written 16 months ago by user3188880
gravatar for igor
10 months ago by
United States
igor10k wrote:

There is an explanation of EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv (should be the same as EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv) on TCGA PancanAtlas Synapse:

File: EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv

Contains batch normalized RNASeqV2 mRNA data.

20531 genes (rows) x 11069 samples (columns). ~1.6 GB file size.

Adjustment procedure:

  1. All Hi-Seq data from UNC were unchanged, with the exception of PRAD (prostate)

  2. All data from BCGSC, whether Hi-Seq or GA, were unchanged

  3. PRAD batch IDs 312 and 320 were adjusted to remove batch effects. Remaining PRAD data were unchanged.

  4. All GA samples from UNC were adjusted to remove platform effects between UNC Hi-Seq and GA samples. The tumor types containing UNC GA samples that were adjusted are UCEC, COAD, and READ.

  5. Genes with mostly zero reads or with residual batch effects (approx. 2-3k or 10% of genes) were removed from the adjusted samples and replaced with NAs. No genes were removed from samples with "No Change" status.

  6. Genes were adjusted using a novel algorithm called EB++; a variant of Empirical Bayes/ComBat algorithm with training/testing features added.

ADD COMMENTlink modified 10 months ago • written 10 months ago by igor10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 856 users visited in the last hour