I'm looking for a way to find similar pairs of protein sequences within a large set. (10,000-100,000 sequences, and maybe more) The sequence lengths are between 8-15 amino acids.
I would like similarity to be defined as score under a substituion matrix. It would be great to be able to handle pairs of different lengths as well.
I've come accross SlideSort. It can do all pairs similarity search for millions of sequences but only in terms of edit distance not score under a substitution matrix.
I've also come accross a few papers that give a method for creating a hamming space embedding of protein sequences under a substitution matrix:
Detecting protein sequence conservation via metric embeddings
Provably sensitive indexing strategies for biosequence similarity search.
Efficient large-scale sequence comparison by locality-sensitive hashing
The embedding is a mapping from protein sequence to bit vector such that the hamming distance between two bit vectors is related to the score of the two protein sequences that map to those two bit vectors.
However these bit vectors representations can be very large and are far apart in distance. So in order to find pairs that are similar the author creates a filter by sampling a few bits (let's say 100) from random positions in the hamming space embedding of all the sequences. Those sequences that have the same sample bits get binned together and get thoroughly compared using the substituion matrix.
The filter lets in some false positives and screens out some false negatives. But its not a problem because the false positives get screened when you do the expensive comparison using the substitution matrix. And the false negative rate can be controlled with the number of iterations and the number of sample bits.
The down side of this approach is that I can only compare sequences of equal length. And its not an exhuastive search.
Thanks a megabase
I should clarify that I only want pairs of sequences from the set that have a similarity score above a certain threshold. And I want to avoid doing an all by all comparison to find these pairs.
To determine the scores so you can apply a threshold, you will have to perform the actual comparisons. As noted in my answer there are ways of reducing the number of comparisons that are required, which may help, and depending on the available compute performing the comparisons may not take too long.