Question: Phylogenetic tree for very large numbers of bacterial assemblies
0
gravatar for yairgatt
11 months ago by
yairgatt10
yairgatt10 wrote:

Hello all, I am currently trying to construct a phylogenetic trees for a large number of bacterial assemblies (hundreds to thousands) derived from the same bacterial species. I am quite willing to lose a lot of the data and to use only the 16S sequences or a small subset of genes. Unfortunately, it seems many phylogenetic and phylogenomic methods are not capable of handling such a large number of sequences. Does anyone possibly know of a method that might be able to construct the tree I am looking for? We have a large cluster available and can spare up to several weeks for the construction.

Many thanks, Yair

ADD COMMENTlink modified 11 months ago by Mensur Dlakic7.2k • written 11 months ago by yairgatt10
1

Can you quantify large by providing some ball park numbers? If you are only going to use a small subset of genes then you could remove redundancy and use a representative sequence a group of assemblies (if the sequence is identical). Programs like MAFTT should be able to handle large number of sequences.

ADD REPLYlink written 11 months ago by genomax92k

Thank you for your reply. Ideally, we are looking in the ballpark of 5,000 assemblies. If that is not possible we have a smaller set of about 1,500 assemblies. I really like the idea about removing the redundancy! But do you think there is any way to do it with a set of several dozen well-conserved genes that would ideally together not have much redundancy between the assemblies?

ADD REPLYlink written 11 months ago by yairgatt10
1

Since these are same species assemblies there should be plenty of redundancy. Have you done any preliminary exploration?

ADD REPLYlink written 11 months ago by genomax92k

I haven't done preliminary exploration yet, as I haven't determined the subset of genes I will use to perform the analysis. It seems I might need to include a large number of genes for this analysis, since I understand that the redundancy could be a problem. 16S is definitely not possible at this resolution.

ADD REPLYlink written 11 months ago by yairgatt10
1

Yairgat,

As genomax said you should remove redundant genomes. Use anipy or similar methods to reduce the dataset, then use bcgTree to for the phylogenetic analysis

ADD REPLYlink written 11 months ago by andres.firrincieli1.0k

Many thanks, I was not familiar with these methods!

ADD REPLYlink written 11 months ago by yairgatt10
1
gravatar for Mensur Dlakic
11 months ago by
Mensur Dlakic7.2k
USA
Mensur Dlakic7.2k wrote:

ezTree will do what you want, though I do question the information one could get from building trees for thousands of very related assemblies. It is very likely that many of them will be identical, so removing redundancy should help. Even those that are non-identical will be > 99% identical, and at that point trees are unlikely to provide fine enough resolution to meaningfully separate your (sub)species.

ADD COMMENTlink modified 11 months ago • written 11 months ago by Mensur Dlakic7.2k

Thank you for the helpful comment! I am hoping to use the constructed tree as a null hypothesis of sort to compare to the phylogenetic profiles of several sequences. Do you think that using a core set of conserved genes will not have enough resolution to clearly separate the different strains to a few clades? Is there any other way I could construct a phylogenetic tree to sufficiently separate such close strains? I am afraid using programs like kSNP would not be possible with this number of assemblies.

ADD REPLYlink written 11 months ago by yairgatt10
1

Do you think that using a core set of conserved genes will not have enough resolution to clearly separate the different strains to a few clades?

Several things should be pointed out here. I assume that with that many assemblies it is unlikely that they will all be complete. That is both good and bad. If they were all complete, you would have thousands of shared genes to concatenate, which would make tree construction very difficult. On the other hand, the presence or absence of "core" proteins in various strains will likely be determined by randomness or sequencing rather than by their true conservation across different strains. Assuming that is the case, ezTree might end up with a random collection of "core" proteins that may or may not be informative.

I don't know enough about the setup of your experiment and differences between strains to make an educated guess whether you will be able to clearly separate strains. Regardless, if the strains are very related, that tree in my estimation can't be anything more than a convenient way to catalog your strains. By the way, have you ever looked at a tree with thousands of branches? I've done my share of looking at hundreds of branches, and can't imaging that I would ever want to sift through thousands of branches.

Lastly, these two programs may give you some indication about the distance between your strain, and they will be much faster than tree building.

https://github.com/marbl/Mash

https://github.com/dib-lab/sourmash

ADD REPLYlink written 11 months ago by Mensur Dlakic7.2k

Regarding the experimental bias, it is definitely true, the difference in quality between different assemblies is tremendous. It is possible that using any "core" genes will require filtration for only assemblies that have a contig with a hit for the complete length of the gene, or something along that line.

Regarding the visual inspection, it won't be necessary, since we are hoping to utilize the resulting tree in a computational pipeline regarding conservation patterns.

Regarding MASH, I love it and it was also my first choice (creating a distance matrix using MASH and using it with something like kSNP), unfortunately running 5000*5000 MASH distances would take too long from my previous tests.

Thanks again for the helpful comments!

ADD REPLYlink written 11 months ago by yairgatt10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1985 users visited in the last hour