Question

Software for evaluating multiple sequence alignments before phylogenetic analysis

3

Entering edit mode

10.0 years ago

Andrzej Zielezinski 11k

Since phylogenetic analysis largely depends on a quality of primary sequence data I usually curate alignments manually and exclude phylogenetically uninformative or misleading sites.

But do you know any tool that could tell me how good my alignment is before sending it to phylogenetic program? Are there any programs or methods for assessing the phylogenetic signal that comes from a given alignment?

phylogenetics msa tree phylogeny alignment • 4.5k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Andrzej Zielezinski 11k

1

Entering edit mode

10.0 years ago

arnstrm ★ 1.9k

T-Coffee, a collection of alignment tools as a utility called M-Coffee that does some sort of evaluation of different aligners and rank them to select the best. Maybe, you can try that.

From their documentation:

One of the most common situation when building multiple sequence alignments is to have several alignments produced by several alternative methods, and not knowing which one to choose. In this section, we show you that you can use M-Coffee to combine your many alignments into one single alignment. We show you here that you can either let T-Coffee compute all the multiple sequence alignments and combine them into one, or you can specify the methods you want to combine. M-Coffee is not always the best methods, but extensive benchmarks on BaliBase, Prefab and Homstrad have shown that it delivers the best alignment 2 times out of 3. If you do not want to use the methods provided by M-Coffee, you can also combine pre-computed alignments.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by arnstrm ★ 1.9k

Ram · Accepted Answer · 2014-10-03

The 'quality' of an alignment is somewhat ambiguous given that an alignment is an inference of homology; we may or may not have a good idea of which sites are homologous in any given sequence set. Alignment algorithms compute alignment scores by assigning certain values to matches, mismatches, insertions/deletions, and gap extensions. These scores are then used to evaluate whether or not an alignment is better than another by simply comparing scores. However, the scoring scheme is arbitrary.

If you are concerned with the quality of your hand-curated alignment (and you may not need to be - expert 'by eye' alignments are often considered acceptable!), I would use your aligner of choice (MAFFT, perhaps?) and estimate the score of your alignment and compare it to the alignment produced or refined by the program.

One other concern: excluding misleading sites is one thing (the program Gblocks will remove regions thought to be resulting from spurious alignment), but removing non-informative sites can impact your analysis. For model-based phylogenetic approaches, invariant or slowly-evolving sites are included in the model as either a proportion of invariable sites or as part of the gamma distribution modeling among-site rate heterogeneity. If you are only selecting variable sites, it may be inappropriate to concatenate them and apply a model or a single evolutionary history. If this is the case, I would recommend SNAPP by David Bryant and others which estimates species trees from SNP data while treating individual gene trees as nuisance parameters.

Hope this helps.