Question: How To Determine The Statistical Significance Of Overlap (Intersect) Between Three Sets
5
6.0 years ago by
bsmith030465140
United States
bsmith030465140 wrote:

I have three overlapping sets and I want to find the probability of finding a larger/greater intersection for 'A intersect B intersect C' (in the example below, I want to find the probability of finding more than 135 elements that are common in sets A, B & C). For a two set problem, I guess I would do a Fisher or chi-square test. Here is what I have attempted so far:

``````### Prepare a 3 way contingency table:
mytable <- array(c(135,116,385,6256,
48,97,274,9555),
dim = c(2,2,2),
dimnames = list(
Is_C = c('Yes','No'),
Is_B = c('Yes','No'),
Is_A = c('Yes','No')))

## test
mantelhaen.test(myrabbit, exact = TRUE, alternative = "greater")
``````

Is this the right test (alongwith the current parameters) to determine what I want or is there a more appropriate test for this?

R statistics • 7.3k views
modified 2.7 years ago by Biostar ♦♦ 20 • written 6.0 years ago by bsmith030465140
1

I was going to suggest you post this also at cross-validated, but then I saw this! Glad biostars are more responsive...

I'm interested to hear what other say as to wether mantelhaen is the right test there. Don't forget if your sets are genomic intervals, the standard methods are less likely to apply due to the non-randomness of the genome. e.g. if all 3 of your datasets are likely to occur in gene-bodies, then that is the relationship, but it will make them appear to be co-occuring if you're considering the entire genome as the background.

Each set consists of a group of genes, and I'm trying to see if the overlap is significant. All the sets are drawn from the full complement of genes across the genome (~17k). Does that answer your question?

Can you tell us if you are looking for genomic overlap?

3
6.0 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

I think you probably want the multivariate version of the hypergeometric. You can find an implementation and documentation on that for R here:

For a strawman case, if we assume that there is no bias, I'm not sure if the above models will apply.

1

that may well be. can you elaborate? bias is a loaded term.

1
6.0 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

The approaches described in this report - Annotation Enrichment Analysis: An Alternative Method for Evaluating the Functional Properties of Gene Sets - may be useful to you, depending on exactly what you're comparing and where you want to take the results. The report is available here. Although the authors discuss an approach in dealing with gene set enrichment using GO terms when not all genes are equally annotated, the approach could be applied to other labels of the entities for which you looking for overlap/enrichment.

That looks like an interesting paper. In the abstract, they say that their method "is able to predict biologically meaningful results that are obscured by the many false-positive enrichment scores that occur in FET (Fisher's Exact Test)...." I wonder if simply using a FDR with FET would correct for some of this. I've done this in the past, but a quick search to find some support for this idea turns up this related paper with a potentially useful Perl package (from the same paper) for doing these calculations.

FDR (which we often employ) and FET may be adequate. We have not yet done what is described in the paper to which I linked, but intend to. It is an interesting approach indeed.

Interesting paper - will go into it a little later.

At the moment, I'm just trying to get a 'strawman' probability. If we assume independence and no bias (i.e. assume that there are ~17k numbered balls in an urn) , what is the probability of finding greater than 135 balls that are common in all the three draws?

Although blatantly incorrect from a biological/genetic point of view, this is just one number that I'll be presenting...

Thanks for the replies!