I'm looking for a way to find similar pairs of protein sequences within a large set. (10,000-100,000 sequences, and maybe more) The sequence lengths are between 8-15 amino acids.
I would like similarity to be defined as score under a substituion matrix. It would be great to be able to handle pairs of different lengths as well.
I've come accross SlideSort. It can do all pairs similarity search for millions of sequences but only in terms of edit distance not score under a substitution matrix.
I've also come accross a few papers that give a method for creating a hamming space embedding of protein sequences under a substitution matrix:
The embedding is a mapping from protein sequence to bit vector such that the hamming distance between two bit vectors is related to the score of the two protein sequences that map to those two bit vectors.
However these bit vectors representations can be very large and are far apart in distance. So in order to find pairs that are similar the author creates a filter by sampling a few bits (let's say 100) from random positions in the hamming space embedding of all the sequences. Those sequences that have the same sample bits get binned together and get thoroughly compared using the substituion matrix.
The filter lets in some false positives and screens out some false negatives. But its not a problem because the false positives get screened when you do the expensive comparison using the substitution matrix. And the false negative rate can be controlled with the number of iterations and the number of sample bits.
The down side of this approach is that I can only compare sequences of equal length. And its not an exhuastive search.
Thanks a megabase