how to use ESTIMATE to infer tumor purity and stromal score from RNA-seq data?
1
3
Entering edit mode
4.7 years ago
lhaiyan3 ▴ 70

Dear all:

Did anyone use ESTIMATE (http://bioinformatics.mdanderson.org/main/ESTIMATE:Overview) to infer tumor purity and stromal score from RNA-seq before? I am not clear how to use this tool and what is the input file format for this tool? They are just several steps, and i did not figure out how to load my own data to run the program? Thanks very much for your great help.

OvarianCancerExpr <- system.file("extdata", "sample_input.txt", package="estimate")
filterCommonGenes(input.f=OvarianCancerExpr, output.f="OV_10412genes.gct", id="GeneSymbol")
estimateScore("OV_10412genes.gct", "OV_estimate_score.gct", platform="affymetrix")
plotPurity(scores="OV_estimate_score.gct", samples="s516", platform="affymetrix")


best

Haiyan Lei

rna-seq • 8.3k views
0
Entering edit mode

Dear Ihaiyan3, Could you figure out how to load your own data to run the program? It is not clear what the input file should be for this tool! I appreciate your help and time!

0
Entering edit mode

@Haiyan Lei and @Raheleh, did you figure out how to use our own data to run the program? Please update if you have managed.

3
Entering edit mode
4.3 years ago
sina.nassiri ▴ 90

The ESTIMATE algorithm (Yoshihara et al. 2013 Nature Communications) is comprised of two steps. In the first step, an enrichment score is calculated using single-sample GSEA (Barbie et al. 2009 Nature). Note that although immune cells are essentially part of the stroma, Yoshihara et al. calculated two enrichment scores. One based on immune-related genes, which they referred to as "immune" score. The other score was calculated based on non-immune genes, which they referred to as "stromal" score. The final ESTIMATE score is the sum of immune and stromal enrichment scores. In the second step, the ESTIMATE enrichment score is converted to tumor purity using the following formula:

Tumour purity = cos (0.6049872018 + 0.0001467884 􏰀 x ESTIMATE score)

where "Tumor purity" represents ABSOLUTE-based tumor purity (ABSOLUTE is another algorithm that computes tumor purity based on somatic DNA copy number alterations), and "ESTIMATE score" represents ESTIMATE enrichment score obtained from TCGA Affymetrix data, as explained above. The key point is that this calibration formula was derived using only Affymetrix data, and therefore cannot be used to convert RNAseq-based ESTIMATE score to tumor purity. That being said, you may still apply the single-sample GSEA algorithm to properly normalized RNAseq data to obtain ESTIMATE enrichment scores, and incorporate them as covariate in your downstream analysis to account for tumor purity.

0
Entering edit mode

This does not answer the question.

0
Entering edit mode

"The key point is that this calibration formula was derived using only Affymetrix data, and therefore cannot be used to convert RNAseq-based ESTIMATE score to tumor purity" ... How does this not answer the question?

0
Entering edit mode

I think you can definitely use ESTIMATE with RNA-seq data as this was done by the authors themselves. See the tool's website.

2
Entering edit mode

First of all, "as this was done by X" is rarely the right approach to verify assumptions of a computational algorithm. Second of all, ESTIMATE is published and the R code is publicly available for anyone to review. The ESTIMATE R package by default only accepts "affymetrix", "agilent", or "illumina" microarray data as input. Can you feed normalized RNAseq data as input to ESTIMATE? You surely can! ESTIMATE uses single sample GSEA to compute immune and stromal scores; it then adds them up to get ESTIMATE score which one can use for downstream analyses. In fact, this is what is provided on their website for TCGA RNAseq data. However, you can’t apply these scores to their formula to calculate tumor purity as this formula was derived specifically for microarray data.

0
Entering edit mode

I vaguely remember the opnion that statistical method developed from array data is not suited on RNA-Seq and this has something to do with the nature of RNA-Seq being zero-sum game (total reads sequenced is fixed). But I could not remember the details. Can you explain this a bit in details? Thanks

0
Entering edit mode

Why does the ESTIMATE score differ for the same TCGA ID between certain datasets? For example, between Yoshihara (2013) versus Aran (2015) record ESTIMATE for TCGA-BL-A3JM as -1365.01 versus 0.9193 respectively. Both use RNASeqV2.

Is this reflection of differing calculation methods? What is the true ESTIMATE score used to calculate purity?