Question

Two Sequence Alignment And Alignment Quality Measure

1

Entering edit mode

11.0 years ago

epigene ▴ 590

I want to compare two sequences and see how similar they are. I'm thinking of doing a two sequence alignment. I can do this one at a time using CLUSTALW but there seems no score/measure of how good the alignment is. Also I have hundreds of sequence pairs so I need a tool to handle all the pairs.

I'm wondering if anyone have recommendations on the available tools to use?

Thanks!

sequence alignment comparison • 7.0k views

ADD COMMENT • link updated 11.0 years ago by Hamish ★ 3.2k • written 11.0 years ago by epigene ▴ 590

score 9 · Answer 1 · 2013-05-03

First off, using a multiple sequence alignment tool, such as ClustalW, to perform a pairwise sequence alignment (i.e. two sequences) is not a good idea. Multiple sequence alignment programs are unable to provide optimal alignments in a reasonable time due to the nature of the multiple alignment problem, and so employ a number of techniques to produce the best alignments they can. A a consequence the alignment produced is not guaranteed to be the best possible alignment.

In contrast pairwise alignment is a well understood problem and a number of methods exist to produce the best alignments possible for two sequences. So the options that come to mind for your case are:

Local/local pairwise alignment (local alignment in both sequences, identifies regions of similairity):
- EMBOSS water: Smith & Waterman optimal local alignment
- EMBOSS matcher: Waterman-Eggert optimal local alignment
- FASTA suite lalign: Huang & Miller optimal local alignment with FASTA statistics
- FASTA suite SSEARCH: Smith & Waterman local optimal alignment with FASTA statistics
Global/global pairwise alignment (global alignment in both sequences, gives an end-to-end alignment):
- EMBOSS needle or needleall: Needleman-Wunsch optimal alignment (higher memory usage)
- EMBOSS stretcher: Myers and Miller optimal alignment (less memory usage)
- FASTA suite GGSEARCH: Needleman-Wunsch optimal alignment with FASTA statistics
Global/local pairwise alignment (global in one sequence and local in the other):
- FASTA suite GLSEARCH: optimal global/local alignment with FASTA statistics

For an overview of the three types of pairwise alignment see the Sequence alignment article in Wikipedia

Depending of the size of your sequence set and the nature of the comparisons you want to make you could run each pair separately or run sets of alignments. For some suggestions about the mechanics of doing this see:

If you are only looking for scores you might want to see if any of the projects listed for "Precalculated Sequence Identities" would help, since it seems reasonable that at least some of your set of sequences are present in other databases. To quickly identify sequences which appear in the public databases you can use services such as:

Protein Identifier Cross-Reference Service (PICR): map protein sequences to database identifiers using UniParc
SeqCksum to get checksums for a sequence, that can then be used to query databases such as UniParc or EMBL-Bank for identical sequences.