Setting histograms apart
2
0
Entering edit mode
7.8 years ago
baxy ▴ 170

Hi,

well this probably is a trivial question, but what I need to do is to quantify how far apart two histograms are. Example: let say I have a histogram that has 10 categories. let say my two bordering cases are: a) histogram with 20 occurrences in first category and non in the rest, b) a histogram with 2 occurrences in each category. Now imagine a whole set of histograms that one can have following a simple rule: from left to right (first to 10th category: i in [1..10] ) no occurrence count in category i + 1 can be higher than in category i.

Is this clear so far? now what I am looking for is the way to say:my histogram x (which is between two bordering categories ) is Z far from the one bordering scenario and K far from the other..

so what i need is a measure that will tell me how randomized ( 2 occ's being the totally randomized and the histogram with 20 occ's in one category being completely ordered) a give histogram is.

Did someone came across this problem ??

thnx

Robert

R sequence alignment sequencing blast • 1.5k views
ADD COMMENT
0
Entering edit mode

ok so for anyone looking for a solution to a similar problem ... there is this thing called Young tableau the math on this is well derived so basically you can avoid "wanky" stats end get exact solutions for discrete variables.

ps

everything has a solution these days

ADD REPLY
2
Entering edit mode
7.8 years ago

Histograms represent probability distributions so you're looking at comparing distributions. A commonly used measure of distance between probability distributions is the Kullback-Leibler divergence (or it's derivative the Jensen-Shannon divergence). It's available in the R entropy package and an example of use is available here. Another one is the earth mover's distance (R package emdist).
However, there may be simpler ways to deal with your problem depending on your goal. For example, you could count overlaps or as already mentioned compare measures of spread if you don't care about central tendencies.

ADD COMMENT
0
Entering edit mode
7.8 years ago

I never had to deal with such a problem, but I suppose that a possible approach is to measure variability of the counts across the categories. A classic measure of variability is the standard deviation : In the random scenario, sd=0 but it gets higher in unbalanced scenario.

However, other metrics might be more approriate here, like skewness.

ADD COMMENT

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6