Question

Enrichment Analysis with Fisher's exact test

0

Entering edit mode

10 weeks ago

beepboop • 0

I have a set of SNPs that are significantly associated with metabolite levels. In order to understand the enrichment pattern in ncRNA i used Fisher's exact test in R. I obtained the following results:

p-value = 0.0149
OR = 0.2248
CI = [0.0271, 0.8180]

Looking at the OR, ncRNA is depleted in my case. I would like to understand how the p-value and CI influence the interpretation? Since the CI does not include a 1, and the p-value is bellow a significance level - does this mean that ncRNA is SIGNIFICANTLY depleted? Or something entirely different?

I have other examples, for e.g. miRNA:

p-value = 0.7808 
OR = 0.5851 
CI= [0.0706, 2.1277]

Based on the OR, miRNA is also depleted. In contrast to ncRNA, its p-value is above a significance level and the CI includes a 1. Does this in turn mean, that miRNA is NOT significantly depleted, but is yet depleted? Could it be that the p-value and CIs should not be the focus of the interpretation, but only the OR should?

Thanks a lot for your time!

enrichment fisher • 402 views

ADD COMMENT • link updated 10 weeks ago by i.sudbery 19k • written 10 weeks ago by beepboop • 0

score 2 · Accepted Answer · 2024-02-13

2

Entering edit mode

10 weeks ago

i.sudbery 19k

If you are counting which transcripts overlap the SNPs that are associated with a particularly category of genomic features (e.g. ncRNA), and which don't, then you cannot just do a Fishers test. There are a range of problems with this, but two big ones are:

Long RNAs are more likely to overlap any given set of SNPs than shorter ones, and, for e.g. ncRNAs tend to be shorter than (e.g.) coding RNAs, so you are likely to find a depletion of ncRNAs for this reason.
Different base compositions are more or less likely to be mutated than others, so, e.g. C rich feature will carry more variations than A rich ones.

I'm pretty sure there are tools designed to do exactly this (althouhg I don't know them off the top of my head).

That aside, your interpretation of the p-values is correct: The "true" OR for ncRNA is likely somewhere between 0.0271 and 0.8180 (likely because ncRNAs are short). Thus is not likely that a true OR of 1 or more would have generated this data.

The "true" OR for miRNAs is likely somewhere between 0.0706 and 2.12777 (probably because there are fewer miRNAs). Thus it is entirely plausible that a true OR of 1 or more could have generated this data, just as its plaussible that a true OR of less than 1 could have generated it.

ADD COMMENT • link 10 weeks ago by i.sudbery 19k

0

Entering edit mode

Thank you very much for this!

ADD REPLY • link 10 weeks ago by beepboop • 0

0

Entering edit mode

I'd like to know what is the connection between ncRNA being short and the "true" OR being somewhere between [0.0271, 0.8180] ? Can you please explain that.

ADD REPLY • link 10 weeks ago by beepboop • 0

1

Entering edit mode

Under the null hypothesis there is no connection between a SNP being in an ncRNA and it being connected to metabolite levels - SNPs are randomly distributed across the ncRNA and the "not-ncRNA" classes and any apparent pattern is just chance. However, because ncRNAs are short, if you scatter SNPs randomly they are very unlikely to fall in ncRNAs.

ADD REPLY • link 10 weeks ago by i.sudbery 19k