Question: Normal to have high variability in Broad Institute's GSEA Output?
0
gravatar for Kristin Muench
10 days ago by
United States
Kristin Muench310 wrote:

I am doing a Preranked GSEA Analysis on the output of a differential expression analysis (genes ranked by sign(shrunken lfc) * -log10(pvalue) ).

I notice that when I feed the exact same files into GSEA, it can give me fairly different FDRs.

For example, when I ran the analysis the first time, it said it detected "32 gene sets are significantly enriched at FDR < 25%".

Ten minutes later, I ran the same files again, and it detected "4 gene sets are significantly enriched at FDR < 25%".

I know that there is some element of randomness in the GSEA algorithm because it sets a seed based on timepoint, but I'm surprised to see such a large difference in the number of pathways passing threshold from run to run.

Is this to be expected?

Thanks for your help.

UPDATE I also notice that the four gene sets called enriched in Round 2 are completely non-overlapping with the thirty-two gene sets enriched in Round 1. This is surprising and a little concerning to me.

go gsea • 121 views
ADD COMMENTlink modified 10 days ago • written 10 days ago by Kristin Muench310

Did you run the third and fourth times? Are you sure you didn't change any settings between the first two times?

ADD REPLYlink written 10 days ago by h.mon19k

that is startling - are you sure you did not reset some other parameter to the analysis? What software are you using? MIT/Broad's Java application?

ADD REPLYlink written 10 days ago by Malcolm.Cook810

That is correct - I am running MIT/Broad's Java Application (64 bit 8 GB version for Mac). I imported the necessary files in once, clicked the "GSEAPreranked" tab, set up the appropriate files in the drop-down boxes, then pressed "Run" - then waited for that to run - then when it finished, pressed "Run" again - and repeated this a total of four times. Number of genes in that FDR < 25% list were:

Round 1: 32

Round 2: 4

Round 3: 15

Round 4: 55

Although the gene sets called significant were different each time, they were "thematically" similar - so, for example, I kept seeing metabolism, adherens junction, and fertilization-related gene sets.

ADD REPLYlink modified 10 days ago • written 10 days ago by Kristin Muench310

What happens if you lower the FDR to 5% and run several times?

ADD REPLYlink written 10 days ago by h.mon19k

The way the GUI is set up you do not have the option to set a FDR pre-run - rather, it generates a new leading edge analysis each time, and then you set the FDA when interpreting the list/deciding what to visualize (although, in my case, the list would still be different each time, even if you do change the FDA for interpretation).

Here are a couple of Prerank analysis variables I did play with:

  • scoring_scheme: weighted or classic. (Both of these lead to variable gene set lists, as long as rnd_seed is set to Timepoint) (Classic may be less variable overall - in my tests, DE genes w/FDR<25% were 103, 96, 101).

  • rnd_seed: Timepoint, or 149: I suspect this is the culprit, because when I set the seed to 149, I get the same results every time.

It would make sense to me if the random element introduced by Timepoint caused a little shifting, but this seems very dramatic.

ADD REPLYlink modified 9 days ago • written 9 days ago by Kristin Muench310

Getting the same results with the same seed is expected, it would be a bug otherwise. Get different results with different seed is also expected - the problem is, if the analysis is robust, the difference should be small.

Did you try correlating FDR or p-values from different runs?

ADD REPLYlink written 9 days ago by h.mon19k
1

I did! The data is puzzling because the effect seems to be of a different magnitude on FDR q-values vs. other types of metrics generated by GSEA.

Code:

# Set up dataset
varsToSave <- c('NAME', 'NOM p-val', 'FWER p-value', 'RANK AT MAX', 'FDR q-val')

plotData <- merge(round1[,names(round1) %in% varsToSave],
             round2[,names(round2) %in% varsToSave],
             by='NAME')

# Run Simple Correlation
testCor <- function(plotData, var1, var2, title){

  #plot
  plot( plotData[,var1], plotData[,var2], main=title)

  # statistical test of correlation
  myTest <- cor.test(plotData[,var1], plotData[,var2])
  print('....')
  print(title)
  print(myTest)
  print('....')
}

# Use function
testCor(plotData, "NOM p-val.x", "NOM p-val.y", 'Nominal p-value between GSEA Run 1 and 2')

testCor(plotData, "FDR q-val.x", "FDR q-val.y", 'FDR q-value between GSEA Run 1 and 2')

testCor(plotData, "RANK AT MAX.x", "RANK AT MAX.y", 'Rank at Max between GSEA Run 1 and 2')

Results:

Correlation1 t = 703.84, df = 2967, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9967966 0.9972255 sample estimates: cor 0.9970187

Correlation 2 t = 3655400000, df = 2967, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 1 1 sample estimates: cor 1

Correlation 3 t = 339.5, df = 2967, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9864366 0.9882444 sample estimates: cor 0.9873726

I furthermore explored which were the outlier points you see on the FDR q-value graph. For these, I observed that the difference between the q-value score in the first and the second analysis could be as high as 0.73. Some culprits:

  • NICOTINAMIDE NUCLEOTIDE BIOSYNTHETIC PROCESS (50 genes)
  • REGULATION OF CELL SHAPE%GOBP%GO:0008360 (92 genes)
  • PID_HIF1_TFPATHWAY%MSIGDB_C2%PID_HIF1 (61)
  • NEGATIVE REGULATION OF CELL CYCLE G2/M PHASE (48 genes)
  • INSULIN IGF PATHWAY-MITOGEN ACTIVATED PROTEIN (28 genes)

(Incidentally, these hits are a few of the 'themes' I see when I run and re-run analyses.

I'm not sure what makes these pathways different. They're not especially big. The max rank can be anywhere from 2 to 432.

I'm not sure how the rank would stay the same but the FDR q-value would change so dramatically. Any ideas?

ADD REPLYlink modified 9 days ago • written 9 days ago by Kristin Muench310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1390 users visited in the last hour