Suppose I have a list of genes

```
mygenes: gene13,gene2,gene111.
```

And given another list of genes

```
gene_categoryA: gene1, gene2, gene44, gene111.
gene_categoryB: gene13,gene34.
```

After comparing `mygenes`

against `gene_categoryA`

we see that there are 2 genes of `categoryA`

in `mygenes`

.

What I want to know whether these 2 genes (gene2 and gene111) occurrence is significantly more than expected.

What's is the best way to go about it.

Thanks Damian. What if I have more than 2 gene category to compare, e.g.

`gene_categoryA,...gene_categoryK`

. Otherway to look at it is that now in the urn the balls are not only red and black, but more colours. How can I modified your code with that? The task is still the same, namely to check whether my set of gene is significantly from gene_categoryA.You would have to use a multivariate hypergeometric distribution. I am not sure if scipy has that function.

if the question is the same (i.e. Check whether the set of genes is significantly from gene_categoryA) then I don't see why it should matter how many categories are there, after all we can abstract all those as "non A categories" and proceed the same way to calculate the probability for category A to be over represented in our the gene list. Am I missing something here?

@Damian: I was wondering why you are subtracting

`1`

from`arg[4]`

when you are calculating the`survival function`

. The same type of question can be asked for adding`1`

to`arg[4]`

when calculating the`CDF`

? Is it because we are working with discrete values and to include the instance`X=x`

in the calculation we have to either add or subtract`1`

?I always forget how these two functions goes (cdf,sf) in terms of whether it is off by one or not when you are want to do > or >=.

I think I got an e-mail from someone asking this same question earlier this year. It turns out the p-value <= than portion of the above script already calculates <=, so it is unnecessary to add 1. The p-value >= portion is still correct since the sf (survival function) calculates >.

I've edited the post to reflect this. I feel bad now for propagating wrong information.