Question

Statistical test to compare two discrete distributions

0

Entering edit mode

8.5 years ago

javier.arpon • 0

Hi all,

I am trying to see whether there is any difference between two distributions of discrete values. These values are between 0,00 and 1,00 (steps of 0,01). They are indexes which represent how a spatial observed patterns differ from specific spatial patterns (i.e. complete randomness), thus these indexes depend on the number of objects of the pattern.

I have already used two Kolmogorov Smirnov test alternative for discrete distributions:

- to test the CSR of each distribution using dgof::ks.test (in R)

- to test distribution vs distribution using Matching::ks.boot (in R)

Does anyone have in mind a better test for this kind of discrete data?

Thanks a lot!

discrete distributions statistical test ks test • 12k views

ADD COMMENT • link updated 8.5 years ago by Giovanni M Dall'Olio 28k • written 8.5 years ago by javier.arpon • 0

0

Entering edit mode

maybe this helps:

http://stats.stackexchange.com/questions/88764/test-for-difference-between-2-empirical-discrete-distributions

http://www.mathworks.com/matlabcentral/answers/63476-quantify-difference-between-discrete-distributions

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by TriS ★ 4.7k

Ram · Answer 1 · 2015-10-28

I would do a linear model modelling on a poisson distribution.

First, merge your datasets into a single dataframe with two columns, like the following:

> d1 = data.frame(x=sample(seq(0,1,0.01),100,replace=T), dataset='d1')
> d2 = data.frame(x=sample(seq(0,1,0.01),100,replace=T), dataset='d2')
> mydf = rbind(d1, d2)

Then you can use a glm model as the following. The poisson distribution is a good choice in this case because you have discrete value with non-negative numbers.

> summary(glm(x~dataset, data=mydf, family='poisson'))
Call:
glm(formula = x ~ dataset, family = "poisson", data = mydf)

Deviance Residuals:
     Min        1Q    Median        3Q       Max  
-1.01617  -0.42070   0.01332   0.33097   0.63025  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.70401    0.14219  -4.951 7.37e-07 ***
datasetd2    0.04294    0.19896   0.216    0.829    
---
....

This summary is telling that an element from dataset d1 has a negative odds ratio (-0.7) of being equal to 1, which is quite in line with the data given that it ranges from 0 to 1.

If the two datasets are different, then you would observe a significant p-value at dataset d2. An element from d2 is slightly more likely (odds ratio of +0.04294 higher than d1) to be equal to 1, but this is not significant (p-value 0.829).

It may be interesting to multiply your scores by 100. In this case the regression will give you the odds of having a difference of 0.01 in the original data:

> glm(I(x*100)~dataset, data=mydf, family='poisson') %>% summary

Call:
glm(formula = I(x * 100) ~ dataset, family = "poisson", data = mydf)

Deviance Residuals:
     Min        1Q    Median        3Q       Max  
-10.1617   -4.2070    0.1332    3.3097    6.3025  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.90116    0.01422 274.360   <2e-16 ***
datasetd2    0.04294    0.01990   2.158   0.0309 *

....

This is telling you that an element in d1 has an high odds ratio of being higher than 0.01. An element in d2 is also slightly more likely to be 0.01 higher than an element in d1 (p-value 0.0309)

------EDIT -------

Sorry for the many edits. Maybe it is easier to interpret the regression using a identity rather than a log link. This time we can do it with a dataset derived from randu, which should be reproducible:

> d1 = data.frame(x=round(randu$x[0:200],2), dataset='d1')
> d2 = data.frame(x=round(randu$x[201:400],2), dataset='d2')
> mydf = rbind(d1, d2)

> summary(glm(x~dataset, data=mydf, family=poisson(link='identity')) )

Call:
glm(formula = x ~ dataset, family = poisson(link = "identity"),
    data = mydf)

Deviance Residuals:
     Min        1Q    Median        3Q       Max  
-1.02699  -0.34096   0.02005   0.32455   0.58137  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.52735    0.05135  10.270   <2e-16 ***
datasetd2   -0.00195    0.07255  -0.027    0.979    
---
....

The regression now tells you that dataset d1 has a mean of 0.52745 and that this is significantly different from 0. Dataset d2 has a mean 0.00195 lower than d1 (it should be 0.52540), and there is no significant difference.

In alternative, since your values range from 0 to 1, you could also model on a binomial distribution:

> summary(glm(x~dataset, data=mydf, family='binomial'))
Call:
glm(formula = x ~ dataset, family = "binomial", data = mydf)

Deviance Residuals:
     Min        1Q    Median        3Q       Max  
-1.20523  -0.54424   0.01911   0.51139   1.18660  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02160    0.20001  -0.108    0.914
datasetd2    0.08682    0.28293   0.307    0.759

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 83.948  on 199  degrees of freedom
Residual deviance: 83.854  on 198  degrees of freedom
AIC: 280.86