Question: Automated Phylogenetic Tree Construction From Every Gene In A Genome?
gravatar for sentausa
5.3 years ago by
sentausa620 wrote:

Hi there :)

My colleague is asked by his boss to construct phylogenetic trees based on each gene in the genome of a new bacterium that was sequenced in our lab. One simple way to do that, my colleague believes, is to blast for similar sequences from genbank for each gene in the genome, build a multiple alignment from this similar sequences for each gene, and construct a tree for each gene. The problem is that the genome has around 2,000 genes in it. So we wonder if there is a way to automate this process.

I myself don't really understand why my colleague's boss wants to have those trees, but my colleague told me that they would like to know the exact phylogenetic position of this new bacterium. Do you think it is a wise way to do that? I mean, I don't understand what they can tell at the end if they have 2,000 phylogenetic trees that might not have similar topologies to one another. I thought about consensus tree, but what if there are different organisms from one tree to the other? Or does he want to find lateral gene transfers?

So please if you have any idea, comments or experience with these kinds of problems, we would be really grateful to hear from you, especially about the tree construction part (it's what my colleague's boss asked for anyway).

phylogeny genome • 3.2k views
ADD COMMENTlink modified 2.5 years ago by Biostar ♦♦ 20 • written 5.3 years ago by sentausa620
gravatar for qiyunzhu
5.3 years ago by
qiyunzhu420 wrote:

It's difficult to know the "exact phylogenetic position" of a bacterium. Due to the massive horizontal gene transfer and other events, the relationships between major bacteria groups are still uncertain. It was proposed by many authors that we'd better use a network instead of a tree to represent the bacteria phylogeny. Simply making 2000 trees won't give you much insights into the phylogeny of the whole bacteria.

If the purpose is to study horizontal gene transfer, then, making many trees may be a reasonable choice. You got the point, that is, the best way so far to do so is to do batch BLAST and make trees based on BLAST results.

There are web databases recording the pre-computed orthologous groups. If you bacteria has been well-annotated already, then simply look them up and you will get an answer. Of course, for a well-studied bacterium the phylogeny is explored already, too. I guess the bacterium your colleague is interested in may be a new one.

There are a few programs which perform batch BLAST and batch tree reconstruction, including: PyPhy, PhyloGenie, PhyloGena. You can check them out. Maybe they are not quite up-to-date. Actually I am currently making a program for the exact same purpose as you asked. It's working already. If your colleague is interested, he/she may talk to me for a straight solution.

==== update ====

If your colleague's boss really wants to build an accurate phylogeny, with the cost of considerable labor and time (I think it's necessary), the proper way is not to get 2000 trees and eyeball them over. Instead, an old solution is to merge these 2000 trees into one master tree (called "supertree"). A more recent and popular solution is to concantenate multiple genes and build a tree based on this giant sequence alignment (called "supermatrix").

You can read this paper for an example, who used 478 genes to build an accurate tree of Bartonella.

==== update ====

Here's a good review of the current methods of handling difficult phylogenies.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by qiyunzhu420

Thank you, and how can my colleague contact you?

ADD REPLYlink written 5.3 years ago by sentausa620

I'm sorry I couldn't understand what you mean with batch blast. Can those programs (as well as yours) perform blast with multiple queries (the 2,000 genes/proteins) at the same time? I've checked with PyPhy and PhyloGena, but I couldn't find any hint that they can do that. Thanks.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by sentausa620

They should be able to do batch BLAST. Check out the PhyloGena paper here, which contain a description of the pipeline. PyPhy may be a bit old and the code isn't freely downloadable.

My program can do precisely the job you desire: it takes a list of proteins (genes) as input, and do BLAST of each single protein against the NCBI database, and pull down the returning hits, with taxonomic information labeled to each hit, then do batch phylogenetic reconstruction locally, using program of choice (I use Neighbor-Joining because it's fast, but I also include option of Maximum Likelihood). One can also choose to realign and trim the sequences before making trees. After all, the trees are be labeled with organism names (instead of accession numbers) so that users can simply click open each tree and eyeball it. The results (2000 analyses) can be summarized upon your specific requirement.

Your colleage may reach me by email: My information is available through my google scholar link.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by qiyunzhu420

Don't simply concatenate all of the genes together. My old PhD lab works fairly extensively with supermatrices, it is important to consider the congruence/incongruence of the genes in your dataset.

ADD REPLYlink written 5.3 years ago by Dan Gaston6.9k

I agree with you. When doing supermatrix it is important to ensure all genes included are orthologous groups, which is difficult. Also one has to do proper partitioning and model tests to ensure the quality of phylogenetic analysis. Plus, one really need a great lot computational resource to handle such a big data set.

ADD REPLYlink written 5.3 years ago by qiyunzhu420
gravatar for Joseph Hughes
5.3 years ago by
Joseph Hughes2.5k
Scotland, UK
Joseph Hughes2.5k wrote:

However, if you are interested in horizontal gene transfer, it might make sense to do 2000 trees. Rather than doing blasting to pull out genes for the phylogenies, I would use one of the databases that already has homologous sets of bacterial genes from HOGENOM. This will provide you with a good starting matrix for your phylogenies.

ADD COMMENTlink written 5.3 years ago by Joseph Hughes2.5k

Thank you for your answer. You mean that my colleague should blast from this HOGENOM database instead? But how to do that (and the following steps to construct the trees) automatically?

ADD REPLYlink written 5.3 years ago by sentausa620
gravatar for SaggiSardar
5.3 years ago by
SaggiSardar20 wrote:

Is this genome a publicly available genome? If so, the SUPERFAMILY resource has a whole-genome phylogenetic tree of all fully-sequenced cellular genomes.

The problem with using BLAST to find all homologs is that it neglects the domain assignment problem, meaning that you sequence clusters and resulting alignments are often poor. The wikipedia page has a good description of the matter. BLAST is good at finding a single close evolutionary homolog, but very poor at finding more distant evolutionary relations as a result.

The tree that I link to above is a tree constructed using protein domains as morphological traits. It's already built and you could likely answer your question very quickly.

I would stay away from building huge supertrees of all gene homologs in bacteria. Or at least think carefully as to what the question that you're asking is before you start. Building that many trees will not be computationally cheap (you will need some serious computer power) and merging them together using supertree methods is not trivial, which may or may not be result of HGT.

Another option would be see where it sits in the ncbi_taxonomy tree, which is a mixture of 16s RNA and manual curation. It should offer an answer to your question.

ADD COMMENTlink written 5.3 years ago by SaggiSardar20

Thanks for your answer. The genome is publicly available as WGS data (in contigs), but my colleague hasn't published the annotation, and I think it explains why we won't find it in the SUPERFAMILY database.

ADD REPLYlink written 5.3 years ago by sentausa620
gravatar for vijay
5.3 years ago by
vijay1.3k wrote:

My first question.. Has your bacterium been characterized ?

If yes, then just a phylogenetic tree based on 16s rRNA gene will be absolutely fine. It will give you a clear idea where your bacterium stands. I don't find a point in opting a phylogenetic study for the entire set of genes present.

ADD COMMENTlink written 5.3 years ago by vijay1.3k

Yes it has been characterized and my colleague already has the 16S rRNA gene tree. But his boss asked him to build these trees, so it's the main problem now. Thanks for answering anyway.

ADD REPLYlink written 5.3 years ago by sentausa620
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1449 users visited in the last hour