Question: A Rank-Weighted Similarity Score
5
gravatar for a3cel2
8.1 years ago by
a3cel250
Canada
a3cel250 wrote:

I want to compare two runs of a similar experiment that ranks genes by an arbitrary score and using any standard correlation metric (e.g. Spearman rank correlation) shows that they are not very similar. However, this is because the experiment is set up in such a way that it has very confident measures for the top scores, but after a while the results become mostly noise. So let's say I had 1000 genes measured and only the top ~30 or so from each run I are the ones I care about -

Is there any similarity metric I could use that gives a higher penalty for differences in rank between high confidence genes (so I would punish it a lot if it rank 10 in one experiment and rank 100 in the other) but the penalty drops off as the ranks being compared gets lower (so I don't particularly care if something is rank 500 in one experiment and rank 1000 in the other)?

I could simply do a top-N overlap approach, but this seems a little bit simplistic and I don't know how to pick 'N' in an unbiased fashion

similarity • 2.6k views
ADD COMMENTlink modified 3.4 years ago by Biostar ♦♦ 20 • written 8.1 years ago by a3cel250
2
gravatar for matted
8.1 years ago by
matted7.4k
Boston, United States
matted7.4k wrote:

One approach would be to find an empirical significance threshold, identify "positive" genes by that criterion, and assess overlap between positive gene sets in a standard way (e.g. hypergeometric test).

If you're feeling more technical, you could look into IDR ("irreproducible discovery rate"), a concept/statistical framework developed by the ENCODE project. It deals with this exact question of how to compare ranked lists of scores (assigned to genomic regions). See here and here for more details.

ADD COMMENTlink written 8.1 years ago by matted7.4k
1
gravatar for Ryan Thompson
8.1 years ago by
Ryan Thompson3.5k
TSRI, La Jolla, CA
Ryan Thompson3.5k wrote:

You can make CAT plots using the BioConductor package matchBox. These plots essentially plot top-N overlap for all values of N.

http://www.bioconductor.org/packages/devel/bioc/html/matchBox.html

ADD COMMENTlink written 8.1 years ago by Ryan Thompson3.5k

I just thought is the correlation between dataSetA.t.and.dataSetB.t is higher than that of dataSetA.t.vs.dataSetC in vig (http://www.bioconductor.org/packages/release/bioc/vignettes/matchBox/inst/doc/matchBox.pdf)?

 

 

ADD REPLYlink written 6.1 years ago by Zhilong Jia1.6k
1
gravatar for Leonor Palmeira
8.1 years ago by
Leonor Palmeira3.8k
Liège, Belgium
Leonor Palmeira3.8k wrote:

I don't think there is much of a difference between: having to define N arbitrarily (which is like giving confidence 1 to the N top genes and confidence 0 to all the lower genes), and having to define a confidence measure (or noise measure) that is very high for the top genes (confidence around 1), and lower when you get to the bottom of your experiment (confidence decreasing to 0).

Correct if I'm wrong, but I am under the impression you do not have such a score available as such,so you would have to define it arbitrarily.

If you do have such a confidence score (let's say a p-value of some sort), I think you are looking for some kind of weighted rank correlation.

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by Leonor Palmeira3.8k

Actually, I do have a p-value available - so this is perfect, thank you!

ADD REPLYlink written 8.1 years ago by a3cel250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2426 users visited in the last hour
_