I am not aware of an easy way to construct reliable species trees based on complete genomes. The general approach that you need to take is to pick one or more genes based on which to base your phylogeny. This could be either 16S rRNA, all ribosomal-protein-coding genes, or other highly conserved genes that are universally present and rarely subject to gene duplications or lateral gene transfer.
Once you have picked the genes, you need to make a multiple sequence alignment(s). You need to do this for each of the genes that you want to use for your phylogeny. For this I would tend to use either muscle or mafft. After that I would use Gblocks to extract the conserved blocks in the alignment(s) in order to not use potentially misaligned parts as the basis for tree building.
If you decided to use multiple genes as the basis for your phylogeny, you now have to make a big decision, namely whether to go for a concatenated alignment approach or a supertree approach. In the first case, you would concatenate all of the multiple alignments and use the resulting big alignment as input for a phylogenetic tree reconstruction program, for example PhyML. In the second case, you would use such a program to make a separate tree for each of the genes of interest, and subsequently use one of several supertree programs to derive a consensus tree based on these. If you went for just using a single gene as the basis for your tree, you obviously just build a tree for that one gene and you are done.
I hope this helps, although it is certainly very far from a "push of a button" solution.
RaxML: I'm not sure but I think this program works for entire genomes and is supposed to be very fast:
Results: In this paper we present the latest release of our program RAxML-III for rapid maximum likelihood-based inference of large evolutionary trees which allows for computation of 1.000-taxon trees in less than 24 hours on a single PC processor. We compare RAxML-III to the currently fastest implementations for maximum likelihood and bayesian inference: PHYML and MrBayes. Whereas RAxML-III performs worse than PHYML and MrBayes on synthetic data it clearly outperforms both programs on all real data alignments used in terms of speed and final likelihood values.
Genome-scale multiple sequence alignments are not quite good for phylogenies: they take a lot of time to compute and are never accurate. Moreover, it's hard to imagine a general-purpose sequence evolution model that would be equally adequate for protein-coding, rRNA, tRNA genes, repeats and other regions. Picking a subset of genes manually is not a nice option either, because you will lose a lot of phylogenetic resolution. I would thus recommend building a tree based on all orthologous genes, which is the most common thing to do as far as I can tell. Here is a general pipeline:
- Annotate your genomes using Prokka (for prokaryotes) or another tool;
- Find one-to-one protein-coding orthologs using OrthoFinder or OrthoMCL;
- Run multiple sequence alignments (MSAs) for each group (any MSA tools will do, but I prefer mafft);
- Filter each MSA using Gblocks;
- Merge filtered alignments (I use Python for that, but I'm pretty sure there are some tools that don't require programming skills);
- Use raxml (maximum likelihood) or beast (bayesian inference) to infer the phylogeny.
Hi, the approach at MicrobesOnline looks interesting. If the 24 species genomes are public and high quality their phylogenetic positions may already be there for you (click on "Species Tree"). If they are unpublished genomes they also allow you to host data privately- although I am only assuming that you would then be able to add them to the existing data sets, I don't know for sure.
The trees are made from 78 protein coding loci, so not "whole genomes" but the difference is probably trivial for most species.
Alignment of whole genomes is a quite delicate task and a pain to parse a lot of different output formats until a measure of distance/similarity emerges. Good aligners are MUMMER and MAUVE. I really like MAUVE, used it to play with a lot of genomes from different strains of E. coli. That's the advantage of whole genome comparision! You can find "species" tree even when 16S says that the distance is zero.
For the phylogeny part of the work, you can use RaxML as said by some folks here. For high number of taxa this guy is the fastest one on the road. In your case a more precise approach is feasible. So, you can use ERATE which is Sean Eddy's version of DNAML from Phylip. It can deal with indels and I recommend it even in the 16S case.
But, if you really don't wanna suffer, just check the Genome-To-Genome Distance Calculator service and choose your own setup. After getting the distances, just use Clearcut to generate a NJ tree. Fast and cheap! Not very accurate if you work with very divergent species.
It's a very old post but I thought I could add to it to help others who might want to do a similar analysis i.e. create phylogenies from whole genomes for prokaryotic species. I have created a basic analysis pipeline that tries to simplify the process of creating phylogenetic trees at species level using only the conserved (otherwise known as the core) genomic content of all the 'bacterial' species. The steps used are described and the script is available at http://mcgp.sourceforge.net/
The VCF2PopTree software would be helpful if you are constructing a phylogenetic tree from VCF or SNP file. It reads even the human genome. It is so cool and it does not need any dependencies.
The software link is as follows: http://sankarsubramanian.net/dat/index.html