Dear all,

I have a list of biological entities (say genes) and I would like to compute all unique pairs (e.g (A,B,C)-> (A,B), (A,C), (BC)) and then calculate their similarity based on their GO. Good up to now, but the number of entities is say 2000 so the number of unique pairs is millions (2000 choose 2) so the similarity for all will take months to compute. I have started to compute the similarity for all pairs and it took 3 weeks for 68000 pairs. As similarity I use the GAPGOM method.

Thus can you suggest me a sound technique on how to sample pairs in order to have significant result?

Thank you in advance!

Why are you doing this with so many pairs? I don't think the GAPGOM package was designed with this in mind.

Because I am working on a similarity function and I want to test how much this function correlates with the GO similarity. So I can't really make it with a little number of pairs of genes because then it would not be significant..

On the point of comparing some genes similarity with their GO similarity, you may be interested in this paper on which I collaborated.

How do you go about the computation? Typical semantic similarity measures like Resnik's across GO biological process domain should take a few hours to compute for all ~20000 protein coding human genes without parallelization.