Question

Setting histograms apart

0

Entering edit mode

7.8 years ago

baxy ▴ 170

Hi,

well this probably is a trivial question, but what I need to do is to quantify how far apart two histograms are. Example: let say I have a histogram that has 10 categories. let say my two bordering cases are: a) histogram with 20 occurrences in first category and non in the rest, b) a histogram with 2 occurrences in each category. Now imagine a whole set of histograms that one can have following a simple rule: from left to right (first to 10th category: i in [1..10] ) no occurrence count in category i + 1 can be higher than in category i.

Is this clear so far? now what I am looking for is the way to say:my histogram x (which is between two bordering categories ) is Z far from the one bordering scenario and K far from the other..

so what i need is a measure that will tell me how randomized ( 2 occ's being the totally randomized and the histogram with 20 occ's in one category being completely ordered) a give histogram is.

Did someone came across this problem ??

thnx

Robert

R sequence alignment sequencing blast • 1.5k views

ADD COMMENT • link updated 7.8 years ago by Jean-Karim Heriche 27k • written 7.8 years ago by baxy ▴ 170

0

Entering edit mode

ok so for anyone looking for a solution to a similar problem ... there is this thing called Young tableau the math on this is well derived so basically you can avoid "wanky" stats end get exact solutions for discrete variables.

ps

everything has a solution these days

ADD REPLY • link 7.7 years ago by baxy ▴ 170

score 2 · Answer 1 · 2016-07-08

Histograms represent probability distributions so you're looking at comparing distributions. A commonly used measure of distance between probability distributions is the Kullback-Leibler divergence (or it's derivative the Jensen-Shannon divergence). It's available in the R entropy package and an example of use is available here. Another one is the earth mover's distance (R package emdist).
However, there may be simpler ways to deal with your problem depending on your goal. For example, you could count overlaps or as already mentioned compare measures of spread if you don't care about central tendencies.

score 0 · Answer 2 · 2016-07-08

I never had to deal with such a problem, but I suppose that a possible approach is to measure variability of the counts across the categories. A classic measure of variability is the standard deviation : In the random scenario, sd=0 but it gets higher in unbalanced scenario.

However, other metrics might be more approriate here, like skewness.