Question

All Pairs Similarity Search Under A Substitution Matrix For A Large Number Of Short Protein Sequences

2

Entering edit mode

12.2 years ago

Harish Tella ▴ 20

I'm looking for a way to find similar pairs of protein sequences within a large set. (10,000-100,000 sequences, and maybe more) The sequence lengths are between 8-15 amino acids.

I would like similarity to be defined as score under a substituion matrix. It would be great to be able to handle pairs of different lengths as well.

I've come accross SlideSort. It can do all pairs similarity search for millions of sequences but only in terms of edit distance not score under a substitution matrix.

I've also come accross a few papers that give a method for creating a hamming space embedding of protein sequences under a substitution matrix:

Detecting protein sequence conservation via metric embeddings

Provably sensitive indexing strategies for biosequence similarity search.

Efficient large-scale sequence comparison by locality-sensitive hashing

The embedding is a mapping from protein sequence to bit vector such that the hamming distance between two bit vectors is related to the score of the two protein sequences that map to those two bit vectors.

However these bit vectors representations can be very large and are far apart in distance. So in order to find pairs that are similar the author creates a filter by sampling a few bits (let's say 100) from random positions in the hamming space embedding of all the sequences. Those sequences that have the same sample bits get binned together and get thoroughly compared using the substituion matrix.

The filter lets in some false positives and screens out some false negatives. But its not a problem because the false positives get screened when you do the expensive comparison using the substitution matrix. And the false negative rate can be controlled with the number of iterations and the number of sample bits.

The down side of this approach is that I can only compare sequences of equal length. And its not an exhuastive search.

Thanks a megabase

protein similarity • 6.2k views

ADD COMMENT • link 12.2 years ago by Harish Tella ▴ 20

0

Entering edit mode

I should clarify that I only want pairs of sequences from the set that have a similarity score above a certain threshold. And I want to avoid doing an all by all comparison to find these pairs.

ADD REPLY • link 12.2 years ago by Harish Tella ▴ 20

0

Entering edit mode

To determine the scores so you can apply a threshold, you will have to perform the actual comparisons. As noted in my answer there are ways of reducing the number of comparisons that are required, which may help, and depending on the available compute performing the comparisons may not take too long.

ADD REPLY • link 12.2 years ago by Hamish ★ 3.2k

Ram · Answer 1 · 2012-02-21

4

Entering edit mode

12.2 years ago

Hamish ★ 3.2k

This sounds like you want to perform an all-against-all comparison (also known as all-vs-all). Using the more generic pairwise sequence alignment tools will address the size issue and the scoring matrix issue.

As I first step I recommend identifying identical sequences and ensuring each unique sequence occurs only once when performing the alignment steps. Since the alignments you get a deterministic there is no point in repeating the same alignment multiple times. See "How to remove the same sequences in the FASTA files?" for details of various methods to perform this type of identification and identity collapse/removal.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.2 years ago by Hamish ★ 3.2k

0

Entering edit mode

+1 for mentioning ssearch.

ADD REPLY • link 12.2 years ago by lh3 33k

0

Entering edit mode

Thanks, but I should have mentioned I want to avoid doing an all by all comparison. Doing (100,000 x 100,000) alignments would take a really long time -- days, I would think.

ADD REPLY • link 12.2 years ago by Harish Tella ▴ 20

score 0 · Answer 2 · 2012-03-08

I've found a method and implementation that already solves this problem. Using it I can quickly find similar pairs without doing an all by all comparison.

Check out these papers:

Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

PFMFind: a system for discovery of peptide homology and function

The code for PFMFind is available at: http://git.stojmirovic.org/PFMFind.git

Note from Dr. Stojmirovic:

At this stage, I am not sure if you only need the indexing scheme (FSindex, written in C) or the entire PFMFind, which contains a GUI, a database schema and a protocol for communicating between the components. The database backend is Postgresql (I think it would be relatively easy to add some code for sqlite3 these days) and setting up the entire system can be tedious. On the other hand, constructing FSindexes is quite easy - all you need is a fasta file and a desired alphabet partition.

You only need the FSIndex and it's python bindings to solve the original problem.

There is also a way to allow a single gap when searching the index. Message me for more info.

Some instructions for getting going:

git clone http://git.stojmirovic.org/PFMFind.git

As soon as you clone it, you need to switch to the modern-deps branch

cd PFMFind

git checkout modern-deps

This repository contains all my code for FSindex/PFMFind. To construct the distribution for the Python package, you must edit the Makefile in the FSindex directory to set the PYDIR variable - it should point to the include directory containing your Python headers. Then do

make swig

and then change into FSindex/src and create a distribution with

cd FSindex/src

python setup.py sdist

Alternatively, you can copy (or symlink) FSindex/src/swig/FS.so into FSindex/src/ShortFrags/Expt and place FSindex/src/ShortFrags into your Python path.

I am assuming you are using Unix and you know how to install and work with Python packages.

The executable scripts are in FSindex/src/ShortFrags/scripts - I think the docs could still be useful on how to use them.

As for the Python packages that are dependences, here is what I'm using Pmw==1.3.2 biopython==1.58 numpy==1.6.1 psycopg2==2.4.4 scipy==0.9.0