Question: What batch correction was applied to pan-Cancer mRNA expression data?
1
gravatar for user31888
3 months ago by
user3188840
United States
user3188840 wrote:

I would need to retrieve the normalisation (and maybe the batch correction method) used to produced the pan-Cancer Atlas mRNA expression matrix (file called 'EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv' found here).

Starting from the raw read counts obtained from the GDC and the same gene panel, I tried FPKM and FPKM-UQ normalisation as described here, but the expression values obtained do not fall at all in the same range as in the pan-Cancer mRNA matrix. Maybe that would suggest a cross-sample batch correction.

My goal is, starting from raw read counts, to normalise expression data from new samples together with the pan-Cancer mRNA data, in order to get a unified expression matrix and to be able to compare apples to apples basically.

Any information or alternative method would be greatly appreciated.

ADD COMMENTlink modified 3 months ago by i.sudbery4.3k • written 3 months ago by user3188840
4
gravatar for i.sudbery
3 months ago by
i.sudbery4.3k
Sheffield, UK
i.sudbery4.3k wrote:

My guess (and it is only a guess), given the name of the file, that this is build from the RSEM quantification results that are present in the Broad Institute's Firehose portal, rather than from read counts.

RSEM use an EM algorithm to build isofrom expression values. A length-weighted sum of these values is then used to create gene expression values.

The firehose documentation states that these are normalised like so:

RSEM expression estimates are normalized to set the upper quartile count at 1000 for gene level and 300 for isoform level estimates.

Please note that this is definately NOT a batch correction and that batch effects have been shown to be a serious problem with PanCancer analyses (although this is at the level of somatic variants)

ADD COMMENTlink written 3 months ago by i.sudbery4.3k

Thanks @i.sudbery !

You are right, the expression data from the different TCGA cancer types have been obtained from Firehose pipelines and merged together to form the pan-Cancer Atlas expression matrix 'EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv'.

Looking at the pipelines used on Firehose ('MapspliceRSEM' here), it seems that RSEM was used for read quantification, then normalised by setting the upper quartile count to 1,000, as you mentionned.

However, when starting from read counts, I still cannot retrieve similar expression values using GetNormalizedMat, along with MedianNorm or QuantileNorm functions from the EBSeq package (manual here).

ADD REPLYlink modified 3 months ago • written 3 months ago by user3188840

You will not be able to retrieve similar quantifications starting from read numbers and RSEM uses a fundementally different model to estimate expression compared to a read counting model.

ADD REPLYlink written 3 months ago by i.sudbery4.3k

@i.sudbery: That's right.

ADD REPLYlink written 3 months ago by user3188840
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2082 users visited in the last hour