Question: The Problem With Converting Long Int To Float In Python
6
gravatar for Gahoo
7.7 years ago by
Gahoo260
United States
Gahoo260 wrote:

I want to implement the equation(cumulative hypergeometric distribution) in this article with python.

alt text

Here are the codes:

from math import factorial

def Binomial(n,k):
    if n > k:
        return factorial(n) / (factorial(k)*factorial(n-k))
    elif n == k:
        return 1

def Hypergeometric(x,m,n,N):
    return 1.0*Binomial(m,x)*Binomial(N-m,n-x)/Binomial(N,n)

def HypergeometricCDF(x,m,n,N):
    cdf=0
    for i in range(x,min(m,n)+1):
        cdf+=Hypergeometric(i,m,n,N)
    return cdf

print HypergeometricCDF(5,40,50,500)
print HypergeometricCDF(50,400,500,5000)

It will raise "OverflowError: long int too large to convert to float". But without converting to float,you would get 0.How to deal with large numbers like this in python? Are there any better ways to calculate cumulative hypergeometric distribution in python?

python • 6.0k views
ADD COMMENTlink written 7.7 years ago by Gahoo260
9
gravatar for Will
7.7 years ago by
Will4.5k
United States
Will4.5k wrote:

The problem is that with genomic scales the NchooseK operation produces giant numbers that often cause overflows in python. The trick is to use the log-nchoosek operations since this will keep even th elargest numbers within a reasonable range.

You can check out my gist here: git://gist.github.com/1051335.git

or this:

from math import log, exp
from mpmath import loggamma

def logchoose(ni, ki):
    try:
        lgn1 = loggamma(ni+1)
        lgk1 = loggamma(ki+1)
        lgnk1 = loggamma(ni-ki+1)
    except ValueError:
        #print ni,ki
        raise ValueError

    return lgn1 - (lgnk1 + lgk1)

def gauss_hypergeom(X, n, m, N):
    """Returns the probability of drawing X successes of m marked items
     in n draws from a bin of N total items."""

    assert N >= m, 'Number of items %i must be larger than the number of marked items %i' % (N, m)
    assert m >= X, 'Number of marked items %i must be larger than the number of sucesses %i' % (m, X)
    assert n >= X, 'Number of draws %i must be larger than the number of sucesses %i' % (n, X)
    assert N >= n, 'Number of draws %i must be smaller than the total number of items %i' % (n, N)

    r1 = logchoose(m, X)
    try:
        r2 = logchoose(N-m, n-X)
    except ValueError:
        return 0
    r3 = logchoose(N,n)

    return exp(r1 + r2 - r3)

def hypergeo_cdf(X, n, m, N):

    assert N >= m, 'Number of items %i must be larger than the number of marked items %i' % (N, m)
    assert m >= X, 'Number of marked items %i must be larger than the number of sucesses %i' % (m, X)
    assert n >= X, 'Number of draws %i must be larger than the number of sucesses %i' % (n, X)
    assert N >= n, 'Number of draws %i must be smaller than the total number of items %i' % (n, N)
    assert N-m >= n-X, 'There are more failures %i than unmarked items %i' % (N-m, n-X)

    s = 0
    for i in range(1, X+1):
        s += max(gauss_hypergeom(i, n, m, N), 0.0)
    return min(max(s,0.0), 1)

Everything should be self-explanitory.

ADD COMMENTlink written 7.7 years ago by Will4.5k

Thanks @Will, that's very helpful! and good to know.

ADD REPLYlink written 7.7 years ago by brentp22k

Thanks a lot.I have thought log function, but don't know how.

ADD REPLYlink written 7.7 years ago by Gahoo260
8
gravatar for brentp
7.7 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

The short answer is use scipy.stats.

I had the same problem and ended up with this https://github.com/brentp/bio-playground/blob/master/utils/list_overlap_p.py#L35

ADD COMMENTlink written 7.7 years ago by brentp22k

even that won't work when N is very large ... You need to use a log-hypercdf

ADD REPLYlink written 7.7 years ago by Will4.5k

nice, last time I checked scipy.stats would return overflow errors on large-N hypergeocdfs ... they must've updated things since then.

ADD REPLYlink written 7.7 years ago by Will4.5k

This looks more decent.But it required Scipy and Numpy.

ADD REPLYlink written 7.7 years ago by Gahoo260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 683 users visited in the last hour