How To Get Pair-Wise Sequence Similarity Score For > 1000 Protein.
3
1
Entering edit mode
12.8 years ago
User2718 ▴ 10

I know how to compare sequence similiarty for one query protein with a large database using blast.

Is it possible to get sequence similarity score for a sets of proteins among themselves? I have sequence (fasta format) for ~ 1000 proteins, I want to do similarity search among themselves.

Can I do it with on-line Blast or EBI tools? Any other tools or softwares could be used to do it efficiently?

Many thanks.

protein sequence similarity • 7.5k views
ADD COMMENT
2
Entering edit mode
12.8 years ago
Leszek 4.2k

and why not to use standard blastall?

to create db just type:

formatdb -i your_sequences.fasta

to make search ( -a no of cores to use; -m tabular output):

blastall -pblastp -d your_sequences.fasta -i your_sequences.fasta -a no_of_threads -m8 -o blast_output

Of course, you can use on-line bl2seq and download the results from there. But command line would be much more convenient I suppose.

ADD COMMENT
0
Entering edit mode

the better way ;)

ADD REPLY
2
Entering edit mode
12.8 years ago

If you want to get pairwise similarity between all 1000 protein sequence in your dataset, try this:

  1. Align your fasta sequence using a standard sequence alignment program
  2. Use alistat (Written by Sean Eddy as part of SQUID - a C function library for sequence analysis) to get pairwise information.

Read Alistat man pages here. You can get various alignment statistics.

A percent pairwise alignment identity is defined as (idents / MIN(len1, len2)) where idents is the number of exact identities and len1, len2 are the unaligned lengths of the two sequences. The "average percent identity", "most related pair", and "most unrelated pair" of the alignment are the average, maximum, and minimum of all (N)(N-1)/2 pairs, respectively.

If you want to query a set of sequence against a custom-database of 1000 sequences: try A: How To Get Pair-Wise Sequence Similarity Score For > 1000 Protein. .

ADD COMMENT
0
Entering edit mode
12.8 years ago
lh3 33k

Probably the best choice is swps3. It performs vectorized Smith-Waterman alignment. 1000x1000 should not take long. Also, you may consider a multiple alignment tool (e.g. muscle) if these proteins are known to come from a gene family.

ADD COMMENT
0
Entering edit mode

Note, that swps3 is buggy. http://diagonalsw.sourceforge.net/#swps3 It is not just a simple software bug but the algorithm is plain wrong.

ADD REPLY
0
Entering edit mode

In addition to SWPS3, there is also diagonalsw that achieve a similar speed. A drawback with SWPS3 is that its algorithm is buggy. It sometimes gives the wrong result. For details see diagonalsw.sourceforge.net/#swps3

ADD REPLY

Login before adding your answer.

Traffic: 1754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6