Computing significance of overlap between two gene lists in Python
0
1
Entering edit mode
2.3 years ago
fr ▴ 170

I have two gene lists, derived from two independent datasets. I want to compute the significance of overlap between two subgroups. This is the case of 2 lists of differentially expressed genes in each dataset. I want to know whether the overlap between both groups would be given by chance. For instance:

Dataset1 total: 500
Dataset1 subgroup: 100

Dataset2 total: 300
Dataset2 subgroup: 50

Intersection between subgroups: 25
Union 1 and 2 (no duplicates): 600

I want to compute how significant is the overlap between subgroups against what would be gotten by chance. How would you do this in Python? I was looking at Fisher's exact test or hypergeometric test but have some problems putting my data into the analyses.

From what I understand, the contingency table would be:

                Dataset1    Dataset2
In_subgroup     100         50
Not_subgroup    400         250
Total           500         300


Here, note that the universe is comprised by 600 unique elements, and not 500+300 (as there are duplicates within dataset1 and dataset2). Given this, and based on another post, I would do this in R:

phyper(24, 100, 500, 50, lower.tail = FALSE)
[1] 9.15e-19


Translating this into Python I would use in scipy:

>>> scipy.stats.hypergeom.cdf(24, 600, 500, 300)
0.0


Can I assume that the difference between both results is numeric error?

python hypergeometric fishers • 1.6k views
1
Entering edit mode
0
Entering edit mode

Similarity of equally-sized distributions can be assessed using a Kolmogorov-Smirnov test. Since your distributions are unequal in size, you may want to look at Mannâ€“Whitney U test. Scipy stats has functions to calculate both quantities.

0
Entering edit mode

@Mensur, thanks for your comment. Perhaps I'm missing something, but I do not want to compare 2 distributions, I want to assess the likelihood of getting the overlap between subgroups out of chance.

2
Entering edit mode

Your two subgroups can be thought of as distributions of numbers. Start with zeros for each gene in the genome, and put 1 when they occur in your list. Do the same for your other list, and you have two distributions of numbers. Not sure whether that's a better test than what you are doing already, but it can be done.