Hi there :)
My colleague is asked by his boss to construct phylogenetic trees based on each gene in the genome of a new bacterium that was sequenced in our lab. One simple way to do that, my colleague believes, is to blast for similar sequences from genbank for each gene in the genome, build a multiple alignment from this similar sequences for each gene, and construct a tree for each gene. The problem is that the genome has around 2,000 genes in it. So we wonder if there is a way to automate this process.
I myself don't really understand why my colleague's boss wants to have those trees, but my colleague told me that they would like to know the exact phylogenetic position of this new bacterium. Do you think it is a wise way to do that? I mean, I don't understand what they can tell at the end if they have 2,000 phylogenetic trees that might not have similar topologies to one another. I thought about consensus tree, but what if there are different organisms from one tree to the other? Or does he want to find lateral gene transfers?
So please if you have any idea, comments or experience with these kinds of problems, we would be really grateful to hear from you, especially about the tree construction part (it's what my colleague's boss asked for anyway).
Thank you, and how can my colleague contact you?
I'm sorry I couldn't understand what you mean with batch blast. Can those programs (as well as yours) perform blast with multiple queries (the 2,000 genes/proteins) at the same time? I've checked with PyPhy and PhyloGena, but I couldn't find any hint that they can do that. Thanks.
They should be able to do batch BLAST. Check out the PhyloGena paper here, which contain a description of the pipeline. PyPhy may be a bit old and the code isn't freely downloadable.
My program can do precisely the job you desire: it takes a list of proteins (genes) as input, and do BLAST of each single protein against the NCBI database, and pull down the returning hits, with taxonomic information labeled to each hit, then do batch phylogenetic reconstruction locally, using program of choice (I use Neighbor-Joining because it's fast, but I also include option of Maximum Likelihood). One can also choose to realign and trim the sequences before making trees. After all, the trees are be labeled with organism names (instead of accession numbers) so that users can simply click open each tree and eyeball it. The results (2000 analyses) can be summarized upon your specific requirement.
Your colleage may reach me by email: qiyunzhu@gmail.com. My information is available through my google scholar link.
Don't simply concatenate all of the genes together. My old PhD lab works fairly extensively with supermatrices, it is important to consider the congruence/incongruence of the genes in your dataset.
I agree with you. When doing supermatrix it is important to ensure all genes included are orthologous groups, which is difficult. Also one has to do proper partitioning and model tests to ensure the quality of phylogenetic analysis. Plus, one really need a great lot computational resource to handle such a big data set.