Question: Whay ranking count data lead to uniform distribution of the features ?
0
jack790 wrote:

One of the normalization methods of count data is ranking . that is you rank the expression of each genes across different samples and it's lead to uniform distribution for every gene.

Does someone knows why it leads to uniform distribution ?

sequencing rna-seq next-gen R • 1.6k views
modified 5.3 years ago by Michael Dondrup46k • written 5.3 years ago by jack790
1
Michael Dondrup46k wrote:

I guess it doesn't. I guess you do not mean 'uniform'? Please try to read a bit on topic before you make such claims.

Edit, your claim is not (generally) true, in case of ties in ranks! (Proof trivial)

The ranks are uniformly distributed from zero to the sample size . Hence, the ranks lead to exactly the same distribution for all genes, which directly leads to exactly equal means and variances for all genes.

While you claim:

that is you rank the expression of each genes across different samples and it's lead to uniform distribution for every gene.

These do not have the same meaning, your sentence could be interpreted as such that the variation in counts across samples for each gene follows a uniform distribution, while what the authors state is that for each sample the ranks of counts are uniformly distributed.

The reason is simply, if you rank N items, you get a different permutation of 1..N as ranks for each sample, however each value  from the domain 1..N is present exactly once, making the distribution uniform (identical probability for each value). Disregarding ties though! If there are many ties, than the resulting distribution of ranks is no longer uniform.

That is true for any set of integers btw, because a set contains each value once, yielding probability 1/N for each value.

(side node: one can also conclude from a different angle on how this sentence was meant (across repeated measurements or samples (rows) or within samples (columns)):

Imagine counts of several genes, which ranks do make sense to use? If we ranked row count values (rank columns for each gene), we would have no normalization for library size whatsoever, and therefore the ranks would not be very informative.

If you compared ranks of genes within columns (rank all genes in each sample), the library size effect would be removed achieving effective normalization. Now in this case, means of row ranks cannot be equal for each pair of genes unless your ranked counts are totally random. Instead, the more reproducible your samples are the better the rank agreement will be between all samples.)

indeed it does!!. so do you have reason why not ?! ;-)  look at this paper figure 2.

1

In science, you need to utilize a very precise language, for example, "leads to uniform distribution for each gene" is something different than "leads to uniform distribution over all genes".