How to calculate connectivity score for a compound in connectivity map O2?
2
1
Entering edit mode
9.4 years ago
Zhilong Jia ★ 2.2k

In connectivity map O2 (build 2), the connectivity score for a compound is resulted from scores of multi instances for the compound. I did not find any document about it in the original paper and the website. Thank you.

For instance, a drug H-7 with average_score 0.596, the enrichment score is 0.940, scores of four instances for H-7 are 0.629, 0.593, 0.585, 0.580. How to get 0.940 from this four scores?

connectivity-map cmap connectivity-score • 6.0k views
2
Entering edit mode
5.8 years ago

For anyone else interested:

Here is a link to documentation of connectivity scores for the old CMap: https://portals.broadinstitute.org/cmap/help_topics_linkified.jsp (also nicely explained in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3868238/)

And here the link to a description of scores in new CMap, v 2.1 clue.io): https://clue.io/connectopedia/cmap_algorithms. Mind you the algorithms differ.

1
Entering edit mode
5.6 years ago

For an instance i (i.e. a perturbagen in specific conditions - cell/dose/time), the final score Si depends on "preliminary" scores si of all other instances:

For postively connected perturbagenes (instances with positive values, si > 0) it is divided by the value of the most positively-connected perturbagen.

Si = si / (maxkall instances(sk))

For negatively connected perturbagenes, it is divided by minus value of the most negatively connected one:

Si = si / (-minkall instances(sk))

Where:

si = upi - downi

• S5941 = 0.629,
• S5968 = 0.593,
• S5963 = 0.585,
• S5936 = 0.580

Enrichment score is based on permutations

The connectivity scores Si are used to sort the list of all instances (perturbagens); if two substances have the same Si, the one with higher upi will be positioned higher. This gives us the rank - in your example, H-7 instances got ranks: 174, 305, 339, 368 (the higher the connectivity score, the higher the position on the list - or the lower the rank).

This list would have a total length of 6100 (the number of instances in the old CMap).

Once the ordering is ready, we can pose the following question:

are the chosen instances accumulated near the top of the sorted list of all instances?

and use Kolomogov-Smirnov (KS) statistic to asses that. A slightly simplified version would be to look at the maximum of absolute differences between:

• a hypothetihcal, equal distribution along the list (let's call it j), and
• the real distribution of the analyzed perturbagens (let's call it Vj)

As there are four instances considered, the distribution j would simply be:

1/4, 2/4, 3/4 and 4/4 (or [0.25, 0.50, 0.75, 1.00])

while the real distriubtion Vj of ranks is [174/6100, 305/6100, 339/6100, 368/6100], or [0.0285, 0.0500, 0.0556, 0.0603].

When we detract the two culmulative distributions (NB it is a nice property of ranks - they give us culmulative distributions) |j - Vj|, we get:

[0.2215, 0.4500, 0.6944, 0.9397]

Where maximum of those is 0.9397 ~= 0.94. This is your enrichment score!

As I mentiond earlier, this is a simplification, as the proper KS calculation would detract one when considering "negative" values. For detailed formulas, see this chapter of the documentation.

Ps. This plot may help to understand the KS:

0
Entering edit mode

Do you happen to know what happens if we score a query signature with only one sign for all genes? So that the calculation for what you refer to as up_i (and the authors refer to as ks_i) is not possible to do in a signature of negatives for example. Simply substitute zero? To clarify in the language of the original paper, the up tag list would be empty in such a case, hence the a and b calculations for it not possible.