Question

Should I Remove All Positions Containing A Gap In A Multiple Alignment Prior To Creating A Phylogenetic Tree?

7

Entering edit mode

13.3 years ago

Panos ★ 1.8k

On the one hand, if I remove all gap-containing positions then I lose a lot of "information". On the other hand, if I don't, I'm not really sure that it's correct.

What do you usually do?

phylogenetics multiple • 15k views

ADD COMMENT • link updated 13.3 years ago by Cmzmasek ▴ 10 • written 13.3 years ago by Panos ★ 1.8k

0

Entering edit mode

Thank you all guys for all your answers and comments! It was really helpful for me (hopefully for others, too)! I'm choosing Dave's as the correct answer, just because that's what I'm thinking to do! Thank you all!

ADD REPLY • link 13.3 years ago by Panos ★ 1.8k

score 8 · Answer 1 · 2011-07-06

8

Entering edit mode

13.3 years ago

scapella ▴ 390

Hi Panos,

I understand your concerns about the gaps influence in phylogenetic tree reconstruction. As it has been said before, I recommend you to keep the gaps but not all of them. You could use one of the programs specifically designed for dealing with that: gBlocks, BGME, trimAl, etc. About the decision of which parameters to use for trimming your alignment, I'd suggest to use trimAl trimal.cgenomics.org) because it detects automatically the gaps distribution from your alignment and then select the most appropriate parameters to discriminate conflicting columns (with a really low phylogenetic signal to noise ratio) from informative ones. There are two or three different automated algorithms for trimming your alignment in terms of gaps, residue similarity, etc.

I hope this might help you.

ADD COMMENT • link 13.3 years ago by scapella ▴ 390

2

Entering edit mode

Vitis, it depends on the alignment. If we are talking a single alignment of reasonable size than manual masking is still the de facto "gold standard", the problem becomes one of reproducibility. Many of the automated programs are, in some ways better, because our intuition about substitution patterns isn't always correct. A combination of the two is usually the best bet

ADD REPLY • link 13.3 years ago by DG 7.3k

1

Entering edit mode

I'm usually a little skeptical about those automatic algorithms. In terms of alignment, I'm still very old-fashioned and always worried about making any decisions. Homology is a very strong statement implying common ancestry, and I think neither human and computer are good enough to make the right assignment of homology.

ADD REPLY • link 13.3 years ago by Vitis ★ 2.5k

0

Entering edit mode

I agree that algorithms would help, especially nowadays we are not limited (somewhat) by amount of sequence data, manual masking will soon be impossible for the size of alignments we'll get. Maybe it's another topic, but I really wish to see some innovations in phylogenetic analyses using new 'synapomorphies' to figure out clades, such as transposon insertion events, gene duplications, structural variations like indels?

ADD REPLY • link 13.3 years ago by Vitis ★ 2.5k

score 6 · Answer 2 · 2011-07-06

6

Entering edit mode

13.3 years ago

Dave Lunt ★ 2.0k

Leave the gaps Panos, it will be just fine. Either the tree building algorithm will use the information (great), or it will treat it as missing data (no harm done). Actually, its preferred solution may be to do a bit of both.

The only harm that can come is if the gap regions are just very badly aligned sequence, and make up a lot of your alignment, in which case you are just feeding it nonsense. If you are confident that you have a pretty decent alignment then just go for it gaps and all.

ADD COMMENT • link 13.3 years ago by Dave Lunt ★ 2.0k

2

Entering edit mode

Not all positions should be retained though. Only columns that you believe with a high confidence are correctly aligned should be included in phylogenetic analysis. Otherwise potentially non-homologous positions are being treated as homologous which gives incorrect phylogenetic signal

ADD REPLY • link 13.3 years ago by DG 7.3k

1

Entering edit mode

I agree Dan. Exploring the effects of using TrimAL, GBlocks etc to delete ambiguously aligned regions could help to find out whether they are introducing conflicting signal or, perhaps more likely, noise. It really depends on what sort of alignment it is, how many ambiguously aligned regions, which we just don't know. I think a good strategy is to run it with gaps and be reasonably confident you are fine. Then test those assumptions before you publish.

ADD REPLY • link 13.3 years ago by Dave Lunt ★ 2.0k

score 4 · Answer 3 · 2011-07-06

I am NOT as profound an expert as some of the other repliers, but I routinely remove gaps and their flanking regions that appear to be ambigously aligned. I also tend to remove other columns where the alignment quality is dubious. The reason might be that I am typically working with divergent proteins that are hard to align.

For the same reason, an approach like BLOCKS mentioned by Michael Kuhn doesn't help me because my sequences are too divergent. If I am really serious about including only the reliably aligned regions, I use M-coffee which takes alignments form multple good programs, compares them, and scores the agreement of the different methods.

score 3 · Answer 4 · 2011-07-06

In brief, I suggest:

Align your sequences to one or more sequences with a solved structure and use that structure to guide and add quality to the alignment. Yes, a high-quality alignment is necessary in order to give you a tree in which you have higher confidence, as Michael stated.

Trim the alignment to the portion of the sequences that align well with the solved structure. This can help to remove columns of dubious quality, as mentioned by Lyco.

Keep the gaps - I agree with Dave.

Build your tree.

Repeat the above using different trimming, aligning algorithms. In the lab, you'd run the experiment a few times to see that results are repeated. Same holds for bioinformatics - and it is often easy to run and rerun.

score 2 · Answer 5 · 2011-07-06

An phylogenetic tree is only meaningful if the alignment you use as input contains only homologous residues. One heuristics to achieve this is the BLOCKS program, which basically looks for regions of very high similarity to determine the regions that are probably homologous. When you download the program, you can also set a threshold for how many gaps to retain... perhaps the default values are a good indication.

score 2 · Answer 6 · 2011-07-06

Some columns with gaps you want to retain, some you want to remove. It isn't the presence of gaps themselves that are the problem, indels are informative when they are present in more than one taxon after all. What you want to do is remove ambiguous and poorly aligned regions. If in doubt as to the homology of one character in a position of your alignment to another character in another sequence it is best to not treat them as homologous.

A few solutions have been proposed such as GBLOCKS and TrimAL. In my (and my groups) experience, gblocks sort of sucks except in the obvious case.

One program that is worth trying is MANUEL. MANUEL is an SVM based masking tool where the training set is an expertly curated set of multiple alignments where manual masking of the alignments were done by two different researchers. Very good performance. If you use FSA or HMMER3 to generate alignments they output confidence scores for columns which can be used in combination with gap percentage as guides to what columns to remove if you are hand masking.

score 1 · Answer 7 · 2011-07-06

1

Entering edit mode

13.3 years ago

Vitis ★ 2.5k

Usually, yes. You want to use the actually character evolution to build a phylogeny. Then you can code the gaps into binary codes to use them as additional information if you really want to incorporate them.

ADD COMMENT • link 13.3 years ago by Vitis ★ 2.5k

1

Entering edit mode

Thanks for the answer vitis! I don't understand, however, what "code the gaps into binary codes" really means... Can you explain a little bit more please?

ADD REPLY • link 13.3 years ago by Panos ★ 1.8k

1

Entering edit mode

For example, you can code the presence a gap shared by several taxa as '1's and absence as '0's. Sometimes, they are 'synapomorphies' that can specify clades.

ADD REPLY • link 13.3 years ago by Vitis ★ 2.5k

score 1 · Answer 8 · 2011-12-09

I usually don't remove all "gap columns" but only those which have a gap in more then 50% of all sequences. Furthermore, as others have pointed out, it is best to play with different options (e.g. remove all gap columns, those with gaps in more then 30%, 50%, 70%, none). If you get different answers relative to your question you try to answer with the phylogenetic analysis, this might be considered a "red flag".