Question

fgsea result "size" parameter - can I pull out size of the original pathway?

0

Entering edit mode

2.7 years ago

garfield320 ▴ 20

I'm using the fgsea package in R to run GSEA, and according to their documentation, the "size" parameter in the result refer to the "size of the pathway after removing genes not present in 'names(stats)'".

So if my pathway contained 100 genes, and the "size" in my result is 10, does that mean that 90 genes in my experiment overlapped with my pathway? If this understanding is correct, is there any way to easily pull out the size of the pathway itself, instead of the size after removing not present genes?

GSEA fgsea pathway R • 2.1k views

ADD COMMENT • link updated 2.7 years ago by alserg ▴ 920 • written 2.7 years ago by garfield320 ▴ 20

0

Entering edit mode

if my pathway contained 100 genes, and the "size" in my result is 10,

it means that only 10 of the genes had gene-level statistic values in the input. Other 90 genes won't be considered, as no rank can be assigned to them.

1) If these are true numbers, this looks very suspicious. Usually most of the pathway genes should be ranked. 2) In the context of GSEA It's incorrect to report full gene set size, as we can't say anything about the genes that were not present. 3) If you really do want this, you can just calculate the sizes yourself, it's shouldn't be hard.

ADD REPLY • link 2.7 years ago by alserg ▴ 920

0

Entering edit mode

alserg These are true numbers, I was checking some pathways manually and I actually seem to have many pathways that were identified significant and have "size" that is <10% of the total number of genes in the pathways. Is there a general threshold for how many pathway genes should be ranked? Any suggestions on what I should inspect to figure out what's going on with my size parameter?

ADD REPLY • link 2.7 years ago by garfield320 ▴ 20

0

Entering edit mode

what is the length of your input stats vector? Should be at least ~10K.

ADD REPLY • link 2.7 years ago by alserg ▴ 920

0

Entering edit mode

alserg I only have 3K. I guess I should have been clear in my initial question, but I'm using pre-ranked GSEA (using log of p value multiplied by the sign of fold change) to analyze proteomic data acquired from the mass spec, and getting 10K proteins on the mass spec would be impossible.

ADD REPLY • link 2.7 years ago by garfield320 ▴ 20

0

Entering edit mode

Got it. Then having 10% could be OK. Still, reporting the initial pathway size can be misleading, as these other genes are not considered at all.

ADD REPLY • link 2.7 years ago by alserg ▴ 920

0

Entering edit mode

I'm also noticing that some of my pathways have much larger sizes (let's say 90%), but the p values for those pathways are really high so I had filtered them initially. Is it possible that the length of my input is somehow affecting the statistical process?

ADD REPLY • link 2.7 years ago by garfield320 ▴ 20