Another quote from the paper

Question

Understanding how to get enrichment score in GSEA software from broad institute

0

Entering edit mode

2.2 years ago

synat.keam ▴ 100

HI All,

Sorry before hand for long question.. My big apology, but really appreciate if you can read and explain.

I have been reading paper on GSEA from broad institute https://www.pnas.org/content/102/43/15545 and trying to uderstand the concept of statistics in there particularly how the scoring is given, but seems not fully understood. Assum I have RNASeq data from two groups of treated and unterated tumour. To perform the GSEA, we input the GSEA software the normalized count of all genes, the algorithm will then calculate the log2fold changes of each gene between the two groups and rank those genes in decreasing order (i.e. gene with highest log2fold changes will sit at the top and gene with lowest log2fold change will sit at the bottom.

Quote from the paper

Step 1: Calculation of an Enrichment Score. We calculate an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S. The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic

It was stated in the paper that "We calculate an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L".

Why does the extreme top and bottom matter? Is it because that suggests the higest and lowest logfold changes between genes?

What happen if the gene from, for example, hypoxia set is found in the middle of the list? Does that gene still get score? if so, what is the absolute value of score based on? simply, what score should be given if Phit is found in the top of the list, middle of the list...etc?

It was mentioned in the paper that The magnitude of the score increment depends on the correlation of the gene with the phenotype Does this mean that the correlation of the log2fold changes / count of a gene with penotype? if my assumption is right, it is okay to compute the correlation with penotype provided phenotype is a continuous variable? However, if some cases, the penotype are just treated or untreated....

========================================

Sentence from the paper. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S

I do not understand what they meant with "decreasing it when we encounter genes not in S". How did they calculate the score for Pmiss?

Another quote from the paper

The ES is the maximum deviation from zero of Phit – Pmiss.

My question.

Given GSEA takes the input from all genes, I think Phit will always be lower than Pmiss because the total genes from specfic gene set are much smaller than the total genes in the ranked list. Am I right? So we always have a negative score? Just my thinking. However, it is normal to see positive and negative enrichment score in GSEA. I do not understand this either.

I find it quite hard to undertand the concept and really appreciate if experts in GSEA here could elaborate. Sorry for the long question. However, I assum that may not just be myself that do not fully understand this and Thank in advance for people taking to read and answer..

Thanks,

synat

GSEA • 686 views

ADD COMMENT • link 23 months ago by synat.keam ▴ 100