Question: Building A Phylogenetic Tree
gravatar for Mark
8.5 years ago by
Mark10 wrote:

Dear All,

What I am stating here is pretty much a textbook problem so please do not be offended if you find it too trivial. I have never been into phylogenetics but at the moment, I need to make a phylogenetic tree for a gene family of interest.

I extracted the sequences for this gene from different bacterial genomes, aligned the sequences and now I am ready to start. I looked at the available programs and there are just too many of them. For some reason, I liked PHYLIP. So I started to use it.

What I propose to do is the following: Use the alignment to build tree using distance based method (both NJ and UPGMA) and see if they both produce the same tree. If that is the case, I should be happy with the same and use this tree. If not, then I might need to explore Parsimony or Likelihood based methods. However, from whatever I have read so far, I understand that these methods run longer to run and sometimes, they do not even converge. I actually did try using dnapenny program from Phylip package and got the message "search broken off" after the program had been running for a couple of hours atleast. Any feedback/suggestions with my problems will be highly appreciated.

Thanks and regards, Mark.

phylogenetics tree • 4.1k views
ADD COMMENTlink modified 6.9 years ago by Biostar ♦♦ 20 • written 8.5 years ago by Mark10
gravatar for Leonor Palmeira
8.5 years ago by
Leonor Palmeira3.8k
Liège, Belgium
Leonor Palmeira3.8k wrote:

Here is a little background on phylogenetic inference methods. First of all, I would say there are four big types of inference methods (roughly in their historical order):

  • Maximum parsimony methods
  • Distance-based methods
  • Maximum likelihood methods
  • Bayesian methods

Nowadays, the golden standard is Maximum likelihood as well as Bayesian methods, specially because they implement probabilistic models of evolution which can be complexified to accurately model evolution (there are many papers on this, I could link you to a few if you are interested). The main issues for other methods are (i) consistency and (ii) robustness. See this paper or this website for some insights. I would, for instance, never use UPGMA (inconsistencies) nor parsimony (problems with high substitution rates or with long branches).

You can use PhyML (maximum likelihood), which is very fast. It can be used from within Seaview which I find quite useful.

ADD COMMENTlink written 8.5 years ago by Leonor Palmeira3.8k

Thanks Leonor. I will look at PhyML and see if I need to run it. Cheers

ADD REPLYlink written 8.5 years ago by Mark10

I always go for RAxML for ML and PhyloBayes for Bayesian inference. FastTree is excellent too if you just want a general idea of the topology..

ADD REPLYlink written 6.9 years ago by 5heikki9.2k
gravatar for Stefano Berri
8.5 years ago by
Stefano Berri4.2k
Cambridge, UK
Stefano Berri4.2k wrote:

Some info you want to provide:

Nucleic or protein sequences? How many sequences do you have? How long are they? What is the typical similarity? Do they contain a particualr motif/domain?

Regardless of the method you use, it is very likely that any two methods will give you two different results, unless there are very few sequences or they are very "easy". In any case, you won't be able to say which one is correct. The approach is to do some bootstrapping (PHILIP has a program that does it), produce 100 or 1000 sub datasets, run the same programs on them, and then see what is the consensus. You will then be able to say how much you "trust" your phylogenetic tree.

Also, a lot of care and "manual" work need to be spent "cleaning" the multiple alignment. Remove gaps that occur in most of the sequences, limit to region of relative similarity. If there are MANY sequences (like > 100) usually you make a rough tree to find groups and then you run within groups.

Hope this helps

ADD COMMENTlink written 8.5 years ago by Stefano Berri4.2k

Thanks Stefano for your rather quick response. My alignment is a set of nucleotide sequences, the total alignment has ~ 500 sequences and the length of the alignment is 1500 nt. The similarity (at the level of protein sequence since I did a BLASTP to find the homologs in the first place) is > 35%. I actually started to look into the bootstrapping and consensus programs in Phylip. I suppose I can look at 1000 replicates, get a consensus tree for these replicates and see if the two results (NJ vs UPGMA) match. The overall alignment is Ok with not too many "gapped regions" so I suppose I might not need to do the manual cleansing of the alignment.

If there is something else that I should be wary of, please let me know. Cheers

ADD REPLYlink written 8.5 years ago by Mark10

May I ask whether it is possible that a given a conserved motif does not align at the required position, let's say, for some of the sequences in a multiple sequence alignment? What is the solution in this case? Hope it's not too much of a digression. Thanks

ADD REPLYlink written 8.5 years ago by Olivier440
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1146 users visited in the last hour