Which Way To Calculate Cumulative Hypergeometric Distribution Is More Accurate?
1
0
Entering edit mode
12.4 years ago
Gahoo ▴ 270

Thanks for the anwsers on this question by @Will and @brentp.But here comes another question,which one is more accurate.As the N grows, the error grows.

Here are the codes:

from math import log, exp
from mpmath import loggamma
from scipy.stats import hypergeom
import time

def logchoose(ni, ki):
try:
lgn1 = loggamma(ni+1)
lgk1 = loggamma(ki+1)
lgnk1 = loggamma(ni-ki+1)
except ValueError:
#print ni,ki
raise ValueError
return lgn1 - (lgnk1 + lgk1)

def gauss_hypergeom(X, n, m, N):
"""Returns the probability of drawing X successes of m marked items
in n draws from a bin of N total items."""

assert N >= m, 'Number of items %i must be larger than the number of marked items %i' % (N, m)
assert m >= X, 'Number of marked items %i must be larger than the number of sucesses %i' % (m, X)
assert n >= X, 'Number of draws %i must be larger than the number of sucesses %i' % (n, X)
assert N >= n, 'Number of draws %i must be smaller than the total number of items %i' % (n, N)

r1 = logchoose(m, X)
try:
r2 = logchoose(N-m, n-X)
except ValueError:
return 0
r3 = logchoose(N,n)

return exp(r1 + r2 - r3)

def hypergeo_cdf(X, n, m, N):

assert N >= m, 'Number of items %i must be larger than the number of marked items %i' % (N, m)
assert m >= X, 'Number of marked items %i must be larger than the number of sucesses %i' % (m, X)
assert n >= X, 'Number of draws %i must be larger than the number of sucesses %i' % (n, X)
assert N >= n, 'Number of draws %i must be smaller than the total number of items %i' % (n, N)
assert N-m >= n-X, 'There are more failures %i than unmarked items %i' % (N-m, n-X)

s = 0
for i in range(0, X+1):
s += max(gauss_hypergeom(i, n, m, N), 0.0)
return min(max(s,0.0), 1)

def hypergeo_sf(X, n, m, N):

assert N >= m, 'Number of items %i must be larger than the number of marked items %i' % (N, m)
assert m >= X, 'Number of marked items %i must be larger than the number of sucesses %i' % (m, X)
assert n >= X, 'Number of draws %i must be larger than the number of sucesses %i' % (n, X)
assert N >= n, 'Number of draws %i must be smaller than the total number of items %i' % (n, N)
assert N-m >= n-X, 'There are more failures %i than unmarked items %i' % (N-m, n-X)

s = 0
for i in range(X, min(m,n)+1):
s += max(gauss_hypergeom(i, n, m, N), 0.0)
return min(max(s,0.0), 1)

print "\n800"
start=time.time()
print "log\t",hypergeo_cdf(20,60,80,800),hypergeo_sf(20,60,80,800),time.time()-start
start=time.time()
print "scipy\t",hypergeom.cdf(20,800,80,60),hypergeom.sf(19,800,80,60),time.time()-start
print "\n8000"
start=time.time()
print "log\t",hypergeo_cdf(200,600,800,8000),hypergeo_sf(200,600,800,8000),time.time()-start
start=time.time()
print "scipy\t",hypergeom.cdf(200,8000,800,600),hypergeom.sf(199,8000,800,600),time.time()-start
print "\n80000"
start=time.time()
print "log\t",hypergeo_cdf(200,600,800,80000),hypergeo_sf(200,600,800,80000),time.time()-start
start=time.time()
print "scipy\t",hypergeom.cdf(200,80000,800,600),hypergeom.sf(199,80000,800,600),time.time()-start


And the result is below:

800
log    0.999999970682 1.77621863705e-07 0.0460000038147
scipy    0.999999970682 1.77621235498e-07 0.00399994850159

8000
log    0.999999999993 2.81852829734e-61 0.466000080109
scipy    1.0 -9.94981874669e-13 0.0139999389648

80000
log    1 2.32353743034e-249 0.475000143051
scipy    0.999999999924 7.60518314991e-11 0.184000015259


As we can see, scipy is faster.But I wonder which one is more accurate.The result is nearly the same when N is small.

python • 5.8k views
0
Entering edit mode

and btw ... the timeit module in the python standard library makes things like this much easier.

0
Entering edit mode

1
Entering edit mode
12.4 years ago
Will 4.5k

I'm not surprised that scipy is faster. Its able to do a more vectorized version ... and possibly more of the calculation in C. I usually wrap the log_nchoosek in a @memorize decorator which speeds things significantly.

I think the only way to check "accuracy" would be to calculate the values using python's Decimal library which lets you have arbitrary digits of precision at each calculation (so there are no round-off/overflow errors).

However, I think the accuracy at low probabilities is not very important. Assuming you're using this to calculate p-values of overlaps all you really care about is its accuracy around your cutoff ... the difference between 2.32374912E-240 and 2.32374929E-240 is arbitrary because they're both below any reasonable cutoff.

The reason I usually choose my version over the scipy is because mpmath is pure python while scipy requires compiled extensions. In some IT lock-down servers this is an important consideration ... because it may take a week or two to get the IT guy to properly install scipy where with a pure-python implementation I can easily just copy the source-code into my python path.

0
Entering edit mode

I believe it's @memoize not @memorize see here

0
Entering edit mode

Thanks for your advice.What I want to get is the second number.The problem is, When N=8000, log_nchoosek=2.81852829734e-61 and scipy=-9.94981874669e-13. I think there shouldn't be negative number.

0
Entering edit mode

yeah ... looks like a round-off error on scipy's part. If I were you I would put an assert in there so if it ever returns a negative number you know it. Because if any of the numbers are negative then you can't believe anything further down the pipeline. Perhaps you could write a "top level" function which looks at N,m,k,x and decides which method to use ... ie. if they're small then use scipy for speed, if they're large use log-method for correctness.

0
Entering edit mode

Scipy does require compiled extensions to work well, but I keep a ~/local install.

See: http://advice.mechanicalkern.com/question/16/how-do-you-install-numpyscipy-with-atlas-support-on-a-machine-with-non-root-access. After that just modify bashrc to include your LDPATH, PATH and PYTHONPATH.

Compiling ATLAS gives me at least 3x to 10x speed up for most work, sometimes more.

When I need more speed in python I use f2py and wrap fortran. For single precision if you're willing to invest in a \$200 videocard you can use PyCuda to wrap nvcc and get ~60x bump for some code over pure c on Xeons.