Question: Get Pairwise Sequence Identity
gravatar for Chris
10.0 years ago by
Chris1.6k wrote:


for some project I need to run pairwise sequence alignments (local ones, i.e. Smith-Waterman) on protein sequences in order to produce a similarity matrix based on the percentage of identical aligned residues. Now, one approach could be to use blast for that: First create the blast database out of the fastafile that contains all sequences to be pairwisely compaired (using formatdb). Then, for each sequence in the file run blast against that database. But that wouldn't return me alignments against _all_ sequences but rather the ones below the given e-value. Setting the e-value to some huge number doesn't seem to be the proper solution to me (which e-value to use? Besides, when setting it too high, the returned alignments exceed the number of available sequences in the database.). Could there be another solution with blast? I'd rather stick with blast if possible before using some other not so popular aligner.

Thanks, Chris

alignment blast • 7.6k views
ADD COMMENTlink modified 9.4 years ago • written 10.0 years ago by Chris1.6k
gravatar for brentp
10.0 years ago by
Salt Lake City, UT
brentp23k wrote:

I think you will have to use a global sequence alignment per protein pair. BLAST is local, it will find smaller chucks that match within a pair of proteins--that is why you get more alignments than sequences in the database. You could use BLAST and set the e-value high as you suggest, but then you'll need some heuristic to decide which HSP / chunk to take from each pair.

The EMBOSS package has an executable called needle which does Needleman-Wunsch global sequence alignment. BioPython has a wrapper for that and other aligners. This also provides a command-line interface and a python API to Needleman-Wunsch so you can get and score alignments.

ADD COMMENTlink written 10.0 years ago by brentp23k
gravatar for Darked89
10.0 years ago by
Barcelona, Spain
Darked894.2k wrote:

Protein alignments of non-related / not similar proteins will give you most likely nonsensical values. Aligners almost never throw a towel, and will give you an output even if you use heamoglobine vs collagen. So if your set of proteins is not pre-screened, you should have some threshold for rejecting random noise comparisons. Blastp score is a good place to look first.

If you go with a non-blast aligners, then (according to my more phylogeny-oriented friend), the good rule is to perform two alignments, one regular (ABCDE vs PQXYZ) and one with reversed protein sequences (EDCBA vs ZYXQP). Only the residues aligning in both forward and reverse alignments are reliable.

EDIT: clarified wording

ADD COMMENTlink modified 10.0 years ago • written 10.0 years ago by Darked894.2k
gravatar for Stefano Berri
10.0 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

If I understood correctly, you want to make a matrix allProteins vs allProteins with "distances" in each cell. And you want to use identity %. Maybe you could use "blast two sequences" (individually aligning every pair of sequences) instead of aligning against a database.

Another suggestion. Why, instead, don't you produce a distance matrix using a phylogenetic program like Phylip? PROTDIST should do. With the advantage that instead of using similarity % you also take into account amminoacid substitution rates...

ADD COMMENTlink modified 10.0 years ago • written 10.0 years ago by Stefano Berri4.1k
gravatar for Giovanni M Dall'Olio
10.0 years ago by
London, UK
Giovanni M Dall'Olio27k wrote:

Are you sure your problem is well defined? Local pairwise alignments with blast may be not the solution you are looking for.

When you calculate the local alignment between two sequences, it means that you are looking for small regions which are similar between the two, and that doesn't give you a measure of how much two sequences are distant. For example have a look at these two sequences:


What will you obtain if you calculate the pairwise local alignment between the two? is it a measure of the distance between the two sequences?

I would try with global alignment instead. I am not sure of which option you can pass to blast to force it to calculate the global alignment between all the sequeces, but maybe it will be enough to give a small cutoff.

ADD COMMENTlink modified 20 months ago by RamRS27k • written 10.0 years ago by Giovanni M Dall'Olio27k
gravatar for Chris
10.0 years ago by
Chris1.6k wrote:

Thanks all for your contributions. It's indeed a local alignment I want. Imagine for example two proteins, each containing two domains: Protein 1 contains domains A,B, protein 2 contains B,C. Now, when using a global alignment that would try to align over the whole sequences which doesn't make sense in my specific case. A local alignment would align only both regions B. As for the similarity of both, this would only be the first step. My goal is to calculate the similarity by means of hssp values which basically relates sequence identity with alignment length, as proposed in some papers (e.g. Sander, Schneider 1991, Proteins or Rost 1999 Protein Engineering). I'm now using the 'water' aligner from the emboss suite.

ADD COMMENTlink written 10.0 years ago by Chris1.6k

as I was saying you in my answer, you may have the case of a protein in which one of the two has a duplicated domain; to continue your example, you could have A,B and A,A,B. In that case, the local alignment doesn't give you any useful information on the distance between proteins. In any case, I agree with darked89 in that you have to do a pre-filtering of the proteins before, otherwise your comparisons don't make sense. The distance you want to calculate is a measure of how many mutations are needed to one seq to anotehr, but if they don't have a common origin, you can't do that.

ADD REPLYlink written 10.0 years ago by Giovanni M Dall'Olio27k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1151 users visited in the last hour