Question

Whay ranking count data lead to uniform distribution of the features ?

0

Entering edit mode

9.8 years ago

jack ▴ 960

One of the normalization methods of count data is ranking . that is you rank the expression of each genes across different samples and it's lead to uniform distribution for every gene.

Does someone knows why it leads to uniform distribution ?

RNA-Seq R rna-seq next-gen sequencing • 2.7k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by jack ▴ 960

Ram · Answer 1 · 2014-07-25

I guess it doesn't. I guess you do not mean 'uniform'? Please try to read a bit on topic before you make such claims.

Edit, your claim is not (generally) true, in case of ties in ranks! (Proof trivial)

Please read "3.2.9 Ranks" of the paper you cite:

The ranks are uniformly distributed from zero to the sample size . Hence, the ranks lead to exactly the same distribution for all genes, which directly leads to exactly equal means and variances for all genes.

While you claim:

that is you rank the expression of each genes across different samples and it's lead to uniform distribution for every gene.

These do not have the same meaning, your sentence could be interpreted as such that the variation in counts across samples for each gene follows a uniform distribution, while what the authors state is that for each sample the ranks of counts are uniformly distributed.

The reason is simply, if you rank N items, you get a different permutation of 1..N as ranks for each sample, however each value from the domain 1..N is present exactly once, making the distribution uniform (identical probability for each value). Disregarding ties though! If there are many ties, than the resulting distribution of ranks is no longer uniform.

That is true for any set of integers btw, because a set contains each value once, yielding probability 1/N for each value.

(side node: one can also conclude from a different angle on how this sentence was meant (across repeated measurements or samples (rows) or within samples (columns)):

Imagine counts of several genes, which ranks do make sense to use? If we ranked row count values (rank columns for each gene), we would have no normalization for library size whatsoever, and therefore the ranks would not be very informative.

If you compared ranks of genes within columns (rank all genes in each sample), the library size effect would be removed achieving effective normalization. Now in this case, means of row ranks cannot be equal for each pair of genes unless your ranked counts are totally random. Instead, the more reproducible your samples are the better the rank agreement will be between all samples.)