Question

How To Get Pair-Wise Sequence Similarity Score For > 1000 Protein.

1

Entering edit mode

12.8 years ago

User2718 ▴ 10

I know how to compare sequence similiarty for one query protein with a large database using blast.

Is it possible to get sequence similarity score for a sets of proteins among themselves? I have sequence (fasta format) for ~ 1000 proteins, I want to do similarity search among themselves.

Can I do it with on-line Blast or EBI tools? Any other tools or softwares could be used to do it efficiently?

Many thanks.

protein sequence similarity • 7.5k views

ADD COMMENT • link updated 12.8 years ago by Khader Shameer 18k • written 12.8 years ago by User2718 ▴ 10

score 2 · Answer 1 · 2011-06-29

2

Entering edit mode

12.8 years ago

Leszek 4.2k

and why not to use standard blastall?

to create db just type:

formatdb -i your_sequences.fasta

to make search ( -a no of cores to use; -m tabular output):

blastall -pblastp -d your_sequences.fasta -i your_sequences.fasta -a no_of_threads -m8 -o blast_output

Of course, you can use on-line bl2seq and download the results from there. But command line would be much more convenient I suppose.

ADD COMMENT • link 12.8 years ago by Leszek 4.2k

0

Entering edit mode

the better way ;)

ADD REPLY • link 12.8 years ago by Cjt ▴ 370

Ram · Answer 2 · 2011-06-30

If you want to get pairwise similarity between all 1000 protein sequence in your dataset, try this:

Align your fasta sequence using a standard sequence alignment program
Use alistat (Written by Sean Eddy as part of SQUID - a C function library for sequence analysis) to get pairwise information.

Read Alistat man pages here. You can get various alignment statistics.

A percent pairwise alignment identity is defined as (idents / MIN(len1, len2)) where idents is the number of exact identities and len1, len2 are the unaligned lengths of the two sequences. The "average percent identity", "most related pair", and "most unrelated pair" of the alignment are the average, maximum, and minimum of all (N)(N-1)/2 pairs, respectively.

If you want to query a set of sequence against a custom-database of 1000 sequences: try A: How To Get Pair-Wise Sequence Similarity Score For > 1000 Protein. .

Ram · Answer 3 · 2011-06-29

0

Entering edit mode

12.8 years ago

lh3 33k

Probably the best choice is swps3. It performs vectorized Smith-Waterman alignment. 1000x1000 should not take long. Also, you may consider a multiple alignment tool (e.g. muscle) if these proteins are known to come from a gene family.

ADD COMMENT • link 12.8 years ago by lh3 33k

0

Entering edit mode

Note, that swps3 is buggy. http://diagonalsw.sourceforge.net/#swps3 It is not just a simple software bug but the algorithm is plain wrong.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.1 years ago by Erik Sjölund • 0

0

Entering edit mode

In addition to SWPS3, there is also diagonalsw that achieve a similar speed. A drawback with SWPS3 is that its algorithm is buggy. It sometimes gives the wrong result. For details see diagonalsw.sourceforge.net/#swps3

ADD REPLY • link 12.1 years ago by Erik Sjölund • 0