Question

How are indels and 'unknown' bases treated in ML phlogenetic tree programs?

0

Entering edit mode

9.0 years ago

SemiQuant ▴ 80

I'm attempting to create ML phylogenetic trees, mainly using RaxML (GTR model) and I cant seem to find much information on how this and other programs treat indels and 'unknown' (N) bases or low confidence (lower case)?

I know that if a column in the alignment consist completely of missing data (-) then RaxMl will discard it, but what about regions where one isolates out of say 100 has an insertion, indicated as "-" in the other isolates? And if there were two isolates with this, would it then regard them as being 'correct'? I have read some papers that show

If anyone could offer insight into this it would be appreciated.

RaxMl phylogenetics • 3.3k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by SemiQuant ▴ 80

score 2 · Answer 1 · 2015-05-05

To be technically correct, missing data is denoted using a '?,' whereas gaps (i.e., indels) are characterized by '-.' Most major phylogenetic software packages treat these as equivalent.

Handling varies among programs, but the most common treatments include:

1) Removing the site completely. This is uncommon, but some programs have per-site missing data cutoffs you can specify.

2) Ignoring them. My understanding is that these characters no longer count toward the single-site likelihood.

3) Treating the character as ambiguous (i.e., an N) and averaging over all possible states. This is how some programs handle other IUPAC ambiguity codes as well.

4) Treating gaps as a fifth character state.

With respect to RAxML, this question has been brought up a couple times on the Google group. I would consult the manual or the paper for your program of interest to really nail down what's going on under the hood.