Computing significance of overlap between two gene lists in Python
Entering edit mode
4.1 years ago
fr ▴ 210

I have two gene lists, derived from two independent datasets. I want to compute the significance of overlap between two subgroups. This is the case of 2 lists of differentially expressed genes in each dataset. I want to know whether the overlap between both groups would be given by chance. For instance:

Dataset1 total: 500
Dataset1 subgroup: 100

Dataset2 total: 300
Dataset2 subgroup: 50

Intersection between subgroups: 25
Union 1 and 2 (no duplicates): 600

I want to compute how significant is the overlap between subgroups against what would be gotten by chance. How would you do this in Python? I was looking at Fisher's exact test or hypergeometric test but have some problems putting my data into the analyses.

From what I understand, the contingency table would be:

                Dataset1    Dataset2
In_subgroup     100         50
Not_subgroup    400         250
Total           500         300

Here, note that the universe is comprised by 600 unique elements, and not 500+300 (as there are duplicates within dataset1 and dataset2). Given this, and based on another post, I would do this in R:

phyper(24, 100, 500, 50, lower.tail = FALSE)
[1] 9.15e-19

Translating this into Python I would use in scipy:

>>> scipy.stats.hypergeom.cdf(24, 600, 500, 300)

Can I assume that the difference between both results is numeric error?

python hypergeometric fishers • 2.7k views
Entering edit mode
Entering edit mode

Similarity of equally-sized distributions can be assessed using a Kolmogorov-Smirnov test. Since your distributions are unequal in size, you may want to look at Mann–Whitney U test. Scipy stats has functions to calculate both quantities.

Entering edit mode

@Mensur, thanks for your comment. Perhaps I'm missing something, but I do not want to compare 2 distributions, I want to assess the likelihood of getting the overlap between subgroups out of chance.

Entering edit mode

Your two subgroups can be thought of as distributions of numbers. Start with zeros for each gene in the genome, and put 1 when they occur in your list. Do the same for your other list, and you have two distributions of numbers. Not sure whether that's a better test than what you are doing already, but it can be done.


Login before adding your answer.

Traffic: 1189 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6