Question: Which Way To Calculate Cumulative Hypergeometric Distribution Is More Accurate?
0
gravatar for Gahoo
2.8 years ago by
Gahoo210
United States
Gahoo210 wrote:

Thanks for the anwsers on this question by @Will and @brentp.But here comes another question,which one is more accurate.As the N grows, the error grows.

Here are the codes:

from math import log, exp
from mpmath import loggamma
from scipy.stats import hypergeom
import time

def logchoose(ni, ki):
    try:
        lgn1 = loggamma(ni+1)
        lgk1 = loggamma(ki+1)
        lgnk1 = loggamma(ni-ki+1)
    except ValueError:
        #print ni,ki
        raise ValueError
    return lgn1 - (lgnk1 + lgk1)

def gauss_hypergeom(X, n, m, N):
    """Returns the probability of drawing X successes of m marked items
     in n draws from a bin of N total items."""

    assert N >= m, 'Number of items %i must be larger than the number of marked items %i' % (N, m)
    assert m >= X, 'Number of marked items %i must be larger than the number of sucesses %i' % (m, X)
    assert n >= X, 'Number of draws %i must be larger than the number of sucesses %i' % (n, X)
    assert N >= n, 'Number of draws %i must be smaller than the total number of items %i' % (n, N)


    r1 = logchoose(m, X)
    try:
        r2 = logchoose(N-m, n-X)
    except ValueError:
        return 0
    r3 = logchoose(N,n)

    return exp(r1 + r2 - r3)

def hypergeo_cdf(X, n, m, N):

    assert N >= m, 'Number of items %i must be larger than the number of marked items %i' % (N, m)
    assert m >= X, 'Number of marked items %i must be larger than the number of sucesses %i' % (m, X)
    assert n >= X, 'Number of draws %i must be larger than the number of sucesses %i' % (n, X)
    assert N >= n, 'Number of draws %i must be smaller than the total number of items %i' % (n, N)
    assert N-m >= n-X, 'There are more failures %i than unmarked items %i' % (N-m, n-X)

    s = 0
    for i in range(0, X+1):
        s += max(gauss_hypergeom(i, n, m, N), 0.0)
    return min(max(s,0.0), 1)

def hypergeo_sf(X, n, m, N):

    assert N >= m, 'Number of items %i must be larger than the number of marked items %i' % (N, m)
    assert m >= X, 'Number of marked items %i must be larger than the number of sucesses %i' % (m, X)
    assert n >= X, 'Number of draws %i must be larger than the number of sucesses %i' % (n, X)
    assert N >= n, 'Number of draws %i must be smaller than the total number of items %i' % (n, N)
    assert N-m >= n-X, 'There are more failures %i than unmarked items %i' % (N-m, n-X)

    s = 0
    for i in range(X, min(m,n)+1):
        s += max(gauss_hypergeom(i, n, m, N), 0.0)
    return min(max(s,0.0), 1)

print "\n800"
start=time.time()
print "log\t",hypergeo_cdf(20,60,80,800),hypergeo_sf(20,60,80,800),time.time()-start
start=time.time()
print "scipy\t",hypergeom.cdf(20,800,80,60),hypergeom.sf(19,800,80,60),time.time()-start
print "\n8000"
start=time.time()
print "log\t",hypergeo_cdf(200,600,800,8000),hypergeo_sf(200,600,800,8000),time.time()-start
start=time.time()
print "scipy\t",hypergeom.cdf(200,8000,800,600),hypergeom.sf(199,8000,800,600),time.time()-start
print "\n80000"
start=time.time()
print "log\t",hypergeo_cdf(200,600,800,80000),hypergeo_sf(200,600,800,80000),time.time()-start
start=time.time()
print "scipy\t",hypergeom.cdf(200,80000,800,600),hypergeom.sf(199,80000,800,600),time.time()-start

And the result is below:

800
log    0.999999970682 1.77621863705e-07 0.0460000038147
scipy    0.999999970682 1.77621235498e-07 0.00399994850159

8000
log    0.999999999993 2.81852829734e-61 0.466000080109
scipy    1.0 -9.94981874669e-13 0.0139999389648

80000
log    1 2.32353743034e-249 0.475000143051
scipy    0.999999999924 7.60518314991e-11 0.184000015259

As we can see, scipy is faster.But I wonder which one is more accurate.The result is nearly the same when N is small.

ADD COMMENTlink written 2.8 years ago by Gahoo210

and btw ... the timeit module in the python standard library makes things like this much easier.

ADD REPLYlink written 2.8 years ago by Will4.0k

I'm newbie...Thanks for your advice.

ADD REPLYlink written 2.8 years ago by Gahoo210
1
gravatar for Will
2.8 years ago by
Will4.0k
Will4.0k wrote:

I'm not surprised that scipy is faster. Its able to do a more vectorized version ... and possibly more of the calculation in C. I usually wrap the log_nchoosek in a @memorize decorator which speeds things significantly.

I think the only way to check "accuracy" would be to calculate the values using python's Decimal library which lets you have arbitrary digits of precision at each calculation (so there are no round-off/overflow errors).

However, I think the accuracy at low probabilities is not very important. Assuming you're using this to calculate p-values of overlaps all you really care about is its accuracy around your cutoff ... the difference between 2.32374912E-240 and 2.32374929E-240 is arbitrary because they're both below any reasonable cutoff.

The reason I usually choose my version over the scipy is because mpmath is pure python while scipy requires compiled extensions. In some IT lock-down servers this is an important consideration ... because it may take a week or two to get the IT guy to properly install scipy where with a pure-python implementation I can easily just copy the source-code into my python path.

ADD COMMENTlink written 2.8 years ago by Will4.0k

I believe it's @memoize not @memorize see here

ADD REPLYlink written 2.8 years ago by Gww2.4k

I believe it's @memoize not @memorize see (here)[http://wiki.python.org/moin/PythonDecoratorLibrary#Memoize]

ADD REPLYlink written 2.8 years ago by Gww2.4k

I believe it's @memoize not @memorize see here http://wiki.python.org/moin/PythonDecoratorLibrary#Memoize

ADD REPLYlink written 2.8 years ago by Gww2.4k

Thanks for your advice.What I want to get is the second number.The problem is, When N=8000, log_nchoosek=2.81852829734e-61 and scipy=-9.94981874669e-13. I think there shouldn't be negative number.

ADD REPLYlink written 2.8 years ago by Gahoo210

yeah ... looks like a round-off error on scipy's part. If I were you I would put an assert in there so if it ever returns a negative number you know it. Because if any of the numbers are negative then you can't believe anything further down the pipeline. Perhaps you could write a "top level" function which looks at N,m,k,x and decides which method to use ... ie. if they're small then use scipy for speed, if they're large use log-method for correctness.

ADD REPLYlink written 2.8 years ago by Will4.0k

Scipy does require compiled extensions to work well, but I keep a ~/local install.

See: http://advice.mechanicalkern.com/question/16/how-do-you-install-numpyscipy-with-atlas-support-on-a-machine-with-non-root-access. After that just modify bashrc to include your LDPATH, PATH and PYTHONPATH.

Compiling ATLAS gives me at least 3x to 10x speed up for most work, sometimes more.

When I need more speed in python I use f2py and wrap fortran. For single precision if you're willing to invest in a $200 videocard you can use PyCuda to wrap nvcc and get ~60x bump for some code over pure c on Xeons.

ADD REPLYlink written 2.6 years ago by Jake Mick50
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 635 users visited in the last hour