Question: Phylogenetic Analysis Of Whole Genomes
gravatar for Aparna
10.5 years ago by
Aparna120 wrote:

hi can anyone tell me the name of the software for performing the alignment and constructing the phylogenetic tree of whole genome. thanks in advance.

phylogenetics tree • 21k views
ADD COMMENTlink modified 21 months ago by rm.umayal240 • written 10.5 years ago by Aparna120

I think you need to elaborate on what exactly you are trying to accomplish. Are you trying to make a species tree or gene trees? How many genomes are you starting from? Are they prokaryotic or eukaryotic genomes?

ADD REPLYlink written 10.5 years ago by Lars Juhl Jensen11k

i need species tree containing 24 species all belonging to prokaryotic genomes

ADD REPLYlink written 10.5 years ago by Aparna120
gravatar for Lars Juhl Jensen
10.5 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

I am not aware of an easy way to construct reliable species trees based on complete genomes. The general approach that you need to take is to pick one or more genes based on which to base your phylogeny. This could be either 16S rRNA, all ribosomal-protein-coding genes, or other highly conserved genes that are universally present and rarely subject to gene duplications or lateral gene transfer.

Once you have picked the genes, you need to make a multiple sequence alignment(s). You need to do this for each of the genes that you want to use for your phylogeny. For this I would tend to use either muscle or mafft. After that I would use Gblocks to extract the conserved blocks in the alignment(s) in order to not use potentially misaligned parts as the basis for tree building.

If you decided to use multiple genes as the basis for your phylogeny, you now have to make a big decision, namely whether to go for a concatenated alignment approach or a supertree approach. In the first case, you would concatenate all of the multiple alignments and use the resulting big alignment as input for a phylogenetic tree reconstruction program, for example PhyML. In the second case, you would use such a program to make a separate tree for each of the genes of interest, and subsequently use one of several supertree programs to derive a consensus tree based on these. If you went for just using a single gene as the basis for your tree, you obviously just build a tree for that one gene and you are done.

I hope this helps, although it is certainly very far from a "push of a button" solution.

ADD COMMENTlink modified 2.4 years ago by _r_am32k • written 10.5 years ago by Lars Juhl Jensen11k

Depends a bit on what you want to do, but as long as the 24 genomes are not too far apart, I agree that 16S rRNA is a good choice. If one wants to attempt to resolve very deep-branching parts of the tree, I believe you need a multi-locus approach to get enough information to be able to do much. But in that case using just 24 genomes would be unlikely to work anyway.

ADD REPLYlink written 10.5 years ago by Lars Juhl Jensen11k

Just don't use any of the alignment software suggested; try something with a "profile"-based alignment or something geared to rRNA.

ADD REPLYlink written 10.5 years ago by Paulo Nuin3.7k

I wold recommend ssu-align for 16S multiple sequence alignment. It uses a 16S HMM.

ADD REPLYlink written 3.0 years ago by Eli Korvigo180

+1 for 16S rRNA instead of whole genome

ADD REPLYlink written 10.5 years ago by Michael Schubert7.0k

+1 and agree on 16S, all other genes will lead to a sort of 'non-standard' approach.

ADD REPLYlink written 10.5 years ago by Michael Dondrup48k

@Paulo, good point. I completely agree that if you want to do rRNA alignment you should use dedicated, profile-based tools. The alignment tools were meant as suggestions for how to make multiple alignments of protein-coding genes.

ADD REPLYlink written 10.5 years ago by Lars Juhl Jensen11k

I would try to use the "fasttree" program, it gives comparable results to PhyML but is much faster, which would be beneficial on a genome wide scale. Anyway, if you use multiple loci of whole genomes for phylogeny reconstruction, there would be only a very tiny difference between different programs. Anyway, if you have whole genome sequences available, do not just rely on 16S rRNAs but take as much as sequence data as possible into account..

ADD REPLYlink written 10.0 years ago by Peter90

Could you give a recommendation for a "supertree" program? I have trees built from genotypes from individual chromosomes and I want to generate a consensus tree.

ADD REPLYlink written 9.6 years ago by User 387550
gravatar for Science_Robot
10.5 years ago by
Gainesville, FL
Science_Robot1.1k wrote:

RaxML: I'm not sure but I think this program works for entire genomes and is supposed to be very fast:

Results: In this paper we present the latest release of our program RAxML-III for rapid maximum likelihood-based inference of large evolutionary trees which allows for computation of 1.000-taxon trees in less than 24 hours on a single PC processor. We compare RAxML-III to the currently fastest implementations for maximum likelihood and bayesian inference: PHYML and MrBayes. Whereas RAxML-III performs worse than PHYML and MrBayes on synthetic data it clearly outperforms both programs on all real data alignments used in terms of speed and final likelihood values.

ADD COMMENTlink modified 2.4 years ago by _r_am32k • written 10.5 years ago by Science_Robot1.1k

Like PhyML and MrBayes, RAxML takes a multiple sequence alignment as input and uses maximum-likelihood to infer an evolutionary tree. It is thus not a tool that you can just give a bunch of genomes and produce a trees; you'd have to first make, for example, a 16S rRNA alignment or a concatenated ribosomal protein alignment.

ADD REPLYlink written 10.5 years ago by Lars Juhl Jensen11k

MrBayes is not an ML method. It's based on Bayesian inference.

ADD REPLYlink written 3.0 years ago by Eli Korvigo180
gravatar for Eli Korvigo
3.0 years ago by
Eli Korvigo180
Russian Federation
Eli Korvigo180 wrote:

Genome-scale multiple sequence alignments are not quite good for phylogenies: they take a lot of time to compute and are never accurate. Moreover, it's hard to imagine a general-purpose sequence evolution model that would be equally adequate for protein-coding, rRNA, tRNA genes, repeats and other regions. Picking a subset of genes manually is not a nice option either, because you will lose a lot of phylogenetic resolution. I would thus recommend building a tree based on all orthologous genes, which is the most common thing to do as far as I can tell. Here is a general pipeline:

  1. Annotate your genomes using Prokka (for prokaryotes) or another tool;
  2. Find one-to-one protein-coding orthologs using OrthoFinder or OrthoMCL;
  3. Run multiple sequence alignments (MSAs) for each group (any MSA tools will do, but I prefer mafft);
  4. Filter each MSA using Gblocks;
  5. Merge filtered alignments (I use Python for that, but I'm pretty sure there are some tools that don't require programming skills);
  6. Use raxml (maximum likelihood) or beast (bayesian inference) to infer the phylogeny.
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Eli Korvigo180
gravatar for Dave Lunt
10.5 years ago by
Dave Lunt2.0k
Hull, UK
Dave Lunt2.0k wrote:

Hi, the approach at MicrobesOnline looks interesting. If the 24 species genomes are public and high quality their phylogenetic positions may already be there for you (click on "Species Tree"). If they are unpublished genomes they also allow you to host data privately- although I am only assuming that you would then be able to add them to the existing data sets, I don't know for sure.

The trees are made from 78 protein coding loci, so not "whole genomes" but the difference is probably trivial for most species.

ADD COMMENTlink modified 2.4 years ago by _r_am32k • written 10.5 years ago by Dave Lunt2.0k
gravatar for Jarretinha
10.0 years ago by
São Paulo, Brazil
Jarretinha3.3k wrote:

Alignment of whole genomes is a quite delicate task and a pain to parse a lot of different output formats until a measure of distance/similarity emerges. Good aligners are MUMMER and MAUVE. I really like MAUVE, used it to play with a lot of genomes from different strains of E. coli. That's the advantage of whole genome comparision! You can find "species" tree even when 16S says that the distance is zero.

For the phylogeny part of the work, you can use RaxML as said by some folks here. For high number of taxa this guy is the fastest one on the road. In your case a more precise approach is feasible. So, you can use ERATE which is Sean Eddy's version of DNAML from Phylip. It can deal with indels and I recommend it even in the 16S case.

But, if you really don't wanna suffer, just check the Genome-To-Genome Distance Calculator service and choose your own setup. After getting the distances, just use Clearcut to generate a NJ tree. Fast and cheap! Not very accurate if you work with very divergent species.

ADD COMMENTlink modified 2.4 years ago by _r_am32k • written 10.0 years ago by Jarretinha3.3k
gravatar for Adam Witney
10.0 years ago by
Adam Witney10
United Kingdom
Adam Witney10 wrote:

Which program would be a more modern and better alternative to Phylip PARS for clustering 0/1 data representing presence/absence of genes amongst multiple strains of bacteria?

ADD COMMENTlink written 10.0 years ago by Adam Witney10

Adam: don't open new questions inside another discussion. Open a new thread instead, otherwise nobody will be able to answer you.

ADD REPLYlink written 10.0 years ago by Giovanni M Dall'Olio27k

I was actually following on from Dave Lunt's comment that said there are better alternatives to Phylip now, but maybe I put the question in the wrong place (should have been a comment on his comment). Thanks

ADD REPLYlink written 10.0 years ago by Adam Witney10
gravatar for Chrispin Chaguza
6.1 years ago by
Wellcome Sanger Institute
Chrispin Chaguza260 wrote:

It's a very old post but I thought I could add to it to help others who might want to do a similar analysis i.e. create phylogenies from whole genomes for prokaryotic species. I have created a basic analysis pipeline that tries to simplify the process of creating phylogenetic trees at species level using only the conserved (otherwise known as the core) genomic content of all the 'bacterial' species. The steps used are described and the script is available at

ADD COMMENTlink modified 15 months ago by _r_am32k • written 6.1 years ago by Chrispin Chaguza260

Hi. Why don't you put it on GitHub ? 

ADD REPLYlink written 6.1 years ago by geek_y11k

At first glance, it appears that your pipeline is what community microbiologists/metagenomics people do as a day-to-day part of a standard analysis. How does yours differ from established pipelines/workflows in the currently published literature?

(Also, to respond to the other comment: It's on SourceForge as an SVN repository.)

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Brice Sarver3.6k

sounds good such a nice tool. So if i align 100 genome using Mauve and generate a whole genome alignment tree and on other hand if i use your tool how much it will be differ, what do you think ???

ADD REPLYlink modified 15 months ago by _r_am32k • written 6.0 years ago by HG1.1k
gravatar for ofanoyi
3.0 years ago by
ofanoyi120 wrote:

This tool may work.

ADD COMMENTlink written 3.0 years ago by ofanoyi120
gravatar for rm.umayal24
21 months ago by
rm.umayal240 wrote:

The VCF2PopTree software would be helpful if you are constructing a phylogenetic tree from VCF or SNP file. It reads even the human genome. It is so cool and it does not need any dependencies.

The software link is as follows:

ADD COMMENTlink written 21 months ago by rm.umayal240
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1636 users visited in the last hour