Question: All Pairs Similarity Search Under A Substitution Matrix For A Large Number Of Short Protein Sequences
gravatar for Harish Tella
8.5 years ago by
Harish Tella20
United States
Harish Tella20 wrote:

I'm looking for a way to find similar pairs of protein sequences within a large set. (10,000-100,000 sequences, and maybe more) The sequence lengths are between 8-15 amino acids.

I would like similarity to be defined as score under a substituion matrix. It would be great to be able to handle pairs of different lengths as well.

I've come accross SlideSort. It can do all pairs similarity search for millions of sequences but only in terms of edit distance not score under a substitution matrix.

I've also come accross a few papers that give a method for creating a hamming space embedding of protein sequences under a substitution matrix:

Detecting protein sequence conservation via metric embeddings

Provably sensitive indexing strategies for biosequence similarity search.

Efficient large-scale sequence comparison by locality-sensitive hashing

The embedding is a mapping from protein sequence to bit vector such that the hamming distance between two bit vectors is related to the score of the two protein sequences that map to those two bit vectors.

However these bit vectors representations can be very large and are far apart in distance. So in order to find pairs that are similar the author creates a filter by sampling a few bits (let's say 100) from random positions in the hamming space embedding of all the sequences. Those sequences that have the same sample bits get binned together and get thoroughly compared using the substituion matrix.

The filter lets in some false positives and screens out some false negatives. But its not a problem because the false positives get screened when you do the expensive comparison using the substitution matrix. And the false negative rate can be controlled with the number of iterations and the number of sample bits.

The down side of this approach is that I can only compare sequences of equal length. And its not an exhuastive search.

Thanks a megabase

similarity protein • 5.0k views
ADD COMMENTlink written 8.5 years ago by Harish Tella20

I should clarify that I only want pairs of sequences from the set that have a similarity score above a certain threshold. And I want to avoid doing an all by all comparison to find these pairs.

ADD REPLYlink written 8.5 years ago by Harish Tella20

To determine the scores so you can apply a threshold, you will have to perform the actual comparisons. As noted in my answer there are ways of reducing the number of comparisons that are required, which may help, and depending on the available compute performing the comparisons may not take too long.

ADD REPLYlink written 8.5 years ago by Hamish3.1k
gravatar for Hamish
8.5 years ago by
Hamish3.1k wrote:

This sounds like you want to perform an all-against-all comparison (also known as all-vs-all). Using the more generic pairwise sequence alignment tools will address the size issue and the scoring matrix issue.

As I first step I recommend identifying identical sequences and ensuring each unique sequence occurs only once when performing the alignment steps. Since the alignments you get a deterministic there is no point in repeating the same alignment multiple times. See "How to remove the same sequences in the FASTA files?" for details of various methods to perform this type of identification and identity collapse/removal.

ADD COMMENTlink modified 10 months ago by RamRS28k • written 8.5 years ago by Hamish3.1k

+1 for mentioning ssearch.

ADD REPLYlink written 8.5 years ago by lh332k

Thanks, but I should have mentioned I want to avoid doing an all by all comparison. Doing (100,000 x 100,000) alignments would take a really long time -- days, I would think.

ADD REPLYlink written 8.5 years ago by Harish Tella20
gravatar for Harish Tella
8.4 years ago by
Harish Tella20
United States
Harish Tella20 wrote:

I've found a method and implementation that already solves this problem. Using it I can quickly find similar pairs without doing an all by all comparison.

Check out these papers:

Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

PFMFind: a system for discovery of peptide homology and function

The code for PFMFind is available at:

Note from Dr. Stojmirovic:

At this stage, I am not sure if you only need the indexing scheme (FSindex, written in C) or the entire PFMFind, which contains a GUI, a database schema and a protocol for communicating between the components. The database backend is Postgresql (I think it would be relatively easy to add some code for sqlite3 these days) and setting up the entire system can be tedious. On the other hand, constructing FSindexes is quite easy - all you need is a fasta file and a desired alphabet partition.

You only need the FSIndex and it's python bindings to solve the original problem.

There is also a way to allow a single gap when searching the index. Message me for more info.

Some instructions for getting going:

git clone

As soon as you clone it, you need to switch to the modern-deps branch

cd PFMFind

git checkout modern-deps

This repository contains all my code for FSindex/PFMFind. To construct the distribution for the Python package, you must edit the Makefile in the FSindex directory to set the PYDIR variable - it should point to the include directory containing your Python headers. Then do

make swig

and then change into FSindex/src and create a distribution with

cd FSindex/src

python sdist

Alternatively, you can copy (or symlink) FSindex/src/swig/ into FSindex/src/ShortFrags/Expt and place FSindex/src/ShortFrags into your Python path.

I am assuming you are using Unix and you know how to install and work with Python packages.

The executable scripts are in FSindex/src/ShortFrags/scripts - I think the docs could still be useful on how to use them.

As for the Python packages that are dependences, here is what I'm using Pmw==1.3.2 biopython==1.58 numpy==1.6.1 psycopg2==2.4.4 scipy==0.9.0

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Harish Tella20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1672 users visited in the last hour