Statistical test to compare two discrete distributions
1
0
Entering edit mode
6.2 years ago

Hi all,

I am trying to see whether there is any difference between two distributions of discrete values. These values are between 0,00 and 1,00 (steps of 0,01). They are indexes which represent how a spatial observed patterns differ from specific spatial patterns (i.e. complete randomness), thus these indexes depend on the number of objects of the pattern.

I have already used two Kolmogorov Smirnov test alternative for discrete distributions:

- to test the CSR of each distribution using dgof::ks.test (in R)

- to test distribution vs distribution using Matching::ks.boot (in R)

Does anyone have in mind a better test for this kind of discrete data?

Thanks a lot!

discrete distributions statistical test ks test • 9.6k views
0
Entering edit mode
6
Entering edit mode
6.2 years ago

I would do a linear model modelling on a poisson distribution.

First, merge your datasets into a single dataframe with two columns, like the following:

> d1 = data.frame(x=sample(seq(0,1,0.01),100,replace=T), dataset='d1')
> d2 = data.frame(x=sample(seq(0,1,0.01),100,replace=T), dataset='d2')
> mydf = rbind(d1, d2)


Then you can use a glm model as the following. The poisson distribution is a good choice in this case because you have discrete value with non-negative numbers.

> summary(glm(x~dataset, data=mydf, family='poisson'))
Call:
glm(formula = x ~ dataset, family = "poisson", data = mydf)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-1.01617  -0.42070   0.01332   0.33097   0.63025

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.70401    0.14219  -4.951 7.37e-07 ***
datasetd2    0.04294    0.19896   0.216    0.829
---
....


This summary is telling that an element from dataset d1 has a negative odds ratio (-0.7) of being equal to 1, which is quite in line with the data given that it ranges from 0 to 1.

If the two datasets are different, then you would observe a significant p-value at dataset d2. An element from d2 is slightly more likely (odds ratio of +0.04294 higher than d1) to be equal to 1, but this is not significant (p-value 0.829).

It may be interesting to multiply your scores by 100. In this case the regression will give you the odds of having a difference of 0.01 in the original data:

> glm(I(x*100)~dataset, data=mydf, family='poisson') %>% summary

Call:
glm(formula = I(x * 100) ~ dataset, family = "poisson", data = mydf)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-10.1617   -4.2070    0.1332    3.3097    6.3025

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  3.90116    0.01422 274.360   <2e-16 ***
datasetd2    0.04294    0.01990   2.158   0.0309 *


....

This is telling you that an element in d1 has an high odds ratio of being higher than 0.01. An element in d2 is also slightly more likely to be 0.01 higher than an element in d1 (p-value 0.0309)

------EDIT -------

Sorry for the many edits. Maybe it is easier to interpret the regression using a identity rather than a log link. This time we can do it with a dataset derived from randu, which should be reproducible:

> d1 = data.frame(x=round(randu$x[0:200],2), dataset='d1') > d2 = data.frame(x=round(randu$x[201:400],2), dataset='d2')
> mydf = rbind(d1, d2)

Call:
glm(formula = x ~ dataset, family = poisson(link = "identity"),
data = mydf)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-1.02699  -0.34096   0.02005   0.32455   0.58137

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.52735    0.05135  10.270   <2e-16 ***
datasetd2   -0.00195    0.07255  -0.027    0.979
---
....


The regression now tells you that dataset d1 has a mean of 0.52745 and that this is significantly different from 0. Dataset d2 has a mean 0.00195 lower than d1 (it should be 0.52540), and there is no significant difference.

In alternative, since your values range from 0 to 1, you could also model on a binomial distribution:

> summary(glm(x~dataset, data=mydf, family='binomial'))
Call:
glm(formula = x ~ dataset, family = "binomial", data = mydf)

Deviance Residuals:
Min        1Q    Median        3Q       Max
-1.20523  -0.54424   0.01911   0.51139   1.18660

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02160    0.20001  -0.108    0.914
datasetd2    0.08682    0.28293   0.307    0.759

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 83.948  on 199  degrees of freedom
Residual deviance: 83.854  on 198  degrees of freedom
AIC: 280.86

1
Entering edit mode

I'm not sure a linear model is the right choice, at least depending on what the OP wants to test, which isn't entirely clear to me. In this case a (g)lm will test for regression coefficient different from 0. However, two distributions might have same mean and still come from different distributions. For example, here d1 and d2 come from two different normal distributions:

set.seed(1)
d1 = data.frame(x= rnorm(n= 100, mean= 0, sd= 1), dataset='d1')
set.seed(2)
d2 = data.frame(x= rnorm(n= 100, mean= 0, sd= 3), dataset='d2')
mydf = rbind(d1, d2)
boxplot(mydf$x ~ mydf$dataset)
summary(lm(x ~ dataset, data= mydf))$coefficients # Pr(>|t|)= 0.5767102 There is no difference according to a linear model since the means are the same, but Kolmogorow-Smirnov test will pick up the difference: ks.test(d1$x, d2\$x) # p-value = 9.57e-06
0
Entering edit mode

That's what I used, ks.test.

First I want to see whether each distribution is not significantly different from a random distribution.

And then compare distributions trying to identify which are significantly different and which are not. This will tell us at the end if two distributions have the same spatial organization (because this is what the indexes represent).

0
Entering edit mode

"different from a random distribution" note that you have to define what random distribution to test against. It might be uniform, binomial, beta etc.

0
Entering edit mode

yes, I didn't say, we want to test each distribution against an uniform one.

0
Entering edit mode

You are right. Moreover with a poisson model we assume that both datasets have the same standard deviation when they have the same mean. However, I don't know if this assumption is correct in the OP data; I understood that these are randomly generated discrete values so a poisson distribution would not be a bad guess.

The linear model can be useful if, instead of wanting to know whether the two datasets have a different distribution, we want to know whether there is any effect when the observation are made under the conditions of the first dataset rather than the second.