Question: Two Sequence Alignment And Alignment Quality Measure
1
7.1 years ago by
epigene490
United States
epigene490 wrote:

I want to compare two sequences and see how similar they are. I'm thinking of doing a two sequence alignment. I can do this one at a time using CLUSTALW but there seems no score/measure of how good the alignment is. Also I have hundreds of sequence pairs so I need a tool to handle all the pairs.

I'm wondering if anyone have recommendations on the available tools to use?

Thanks!

sequence alignment comparison • 5.4k views
modified 7.1 years ago by Hamish3.1k • written 7.1 years ago by epigene490
9
7.1 years ago by
Hamish3.1k
UK
Hamish3.1k wrote:

First off, using a multiple sequence alignment tool, such as ClustalW, to perform a pairwise sequence alignment (i.e. two sequences) is not a good idea. Multiple sequence alignment programs are unable to provide optimal alignments in a reasonable time due to the nature of the multiple alignment problem, and so employ a number of techniques to produce the best alignments they can. A a consequence the alignment produced is not guaranteed to be the best possible alignment.

In contrast pairwise alignment is a well understood problem and a number of methods exist to produce the best alignments possible for two sequences. So the options that come to mind for your case are:

• Local/local pairwise alignment (local alignment in both sequences, identifies regions of similairity):
• EMBOSS water: Smith & Waterman optimal local alignment
• EMBOSS matcher: Waterman-Eggert optimal local alignment
• FASTA suite lalign: Huang & Miller optimal local alignment with FASTA statistics
• FASTA suite SSEARCH: Smith & Waterman local optimal alignment with FASTA statistics
• Global/global pairwise alignment (global alignment in both sequences, gives an end-to-end alignment):
• EMBOSS needle or needleall: Needleman-Wunsch optimal alignment (higher memory usage)
• EMBOSS stretcher: Myers and Miller optimal alignment (less memory usage)
• FASTA suite GGSEARCH: Needleman-Wunsch optimal alignment with FASTA statistics
• Global/local pairwise alignment (global in one sequence and local in the other):
• FASTA suite GLSEARCH: optimal global/local alignment with FASTA statistics

For an overview of the three types of pairwise alignment see the Sequence alignment article in Wikipedia

Depending of the size of your sequence set and the nature of the comparisons you want to make you could run each pair separately or run sets of alignments. For some suggestions about the mechanics of doing this see:

If you are only looking for scores you might want to see if any of the projects listed for "Precalculated Sequence Identities" would help, since it seems reasonable that at least some of your set of sequences are present in other databases. To quickly identify sequences which appear in the public databases you can use services such as:

Thanks for those tool recommendations !

I have found FASTA and tried it. It outputs a similarity score and percentage identity and a visual alignment. I think it's good enough for my purpose.

I do have a question on your answer. My sequences are kind of random sequences. How do I decide if it's local/local or global/global or global/local pairwise alignment?

Thanks!

1

The choice between local/local, global/global and global/local is driven by the nature of the sequences.

Global/global (or just plain 'global') aligns the sequences from end-to-end, and so suits cases where you expect the sequences to be similar over their whole length and they are co-linear (i.e. little to no rearrangement). This works well with closely related sequences.

Local/local (or just plain 'local') finds regions of similarity, and thus copes with rearrangements and alignments with sub-sequences/super-sequences since it does not require end-to-end similarity. This flexibility is why general purpose sequence similarity search methods, such as BLAST and FASTA, use local pairwise alignments (and some nifty statistics) to find database sequences which are similar to a query sequence. The down side of local alignments, is that they are local. While gaping, drop-off and HSPs mitigate this, there are still cases where the local alignment provides insufficient overlap between the two sequences.

Global/local provides a hybrid option which gives end-to-end coverage in one sequence (for GLSEARCH this is the query) while the other sequence need only provide a local region. This is great when searching with sequence fragments, since it provides complete coverage of the query while allowing for the hit to be much longer than the query. This approach is also used in sequence mapping tools, where the aim is to map a short(er) sequence on to a long(er) sequence.

So for a rule of thumb:

• Global alignment works best for sequences of similar length. To improve performance GGSEARCH only performs alignments for sequences between 80% and 120% the length of the query because of this.
• Local alignment works well for the general case, but can give issues if alignment coverage is a factor.
• Global/local alignment is ideal for cases where you know that one sequence must meet the global criteria, but are not sure about the other.

For a first pass, you are probably best to go with local alignments. Then depending on those results you may want to examine a subset using global or global/local alignment.

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Hamish3.1k