Question

how many genomes from BLAST hit should be considered for phylogeny tree analysis?

1

Entering edit mode

4.5 years ago

DanielC ▴ 170

Dear Friends,

I am trying to perform phylogeny analysis on an assembled genomes. From BLASTn searchI got 250 genomes. I am looking to generated phylogeny tree using these genomes and the assembled genome. However, all programs I have used till now like, Mugsy, Clustal Omega, Mafft, failed in producing alignment for these genomes. The length of these genomes is about 80,000 bases to 100, 000 bases. Can you please let me know how can I perform phylogeny analysis using these genomes?

I have performed phylogeny analysis at gene level using the terminase subunit of these genomes, and got the results too, but I am interested in performing genome level phylogeny analysis. Thanks!

phylogeny tree taxonomy genome • 1.2k views

ADD COMMENT • link updated 4.5 years ago by Biostar 20 • written 4.5 years ago by DanielC ▴ 170

0

Entering edit mode

What if one of the genomes is assembled poorly, will that not influence the tree?

ADD REPLY • link 4.5 years ago by gb ★ 2.2k

0

Entering edit mode

All validations are done for the assembled genome and it is significantly accurate. Regarding the other genomes from the BLASTn hit, i cannot say, but the sequence identity and query coverage is > 92% for most of the BLAST hits. Also, that is not the concern for me at the moment. Could you please let me know if you have any suggestion on the question I asked? Thanks!

ADD REPLY • link 4.5 years ago by DanielC ▴ 170

0

Entering edit mode

Also, that is not the concern for me at the moment.

Yes it is, otherwise the tree is useless. If the genomes are from the same organism you could do it based on the snps in the exons or orf's. But that requires some work. For evolutionary distance using marker genes like 16S, COI, ITS etc. is the easiest way I guess.

ADD REPLY • link 4.5 years ago by gb ★ 2.2k

0

Entering edit mode

Sounds like these are phages, so finding 'markers' like 16S will likely not be an option.

There will be some reasonably well-conserved proteins (potentially like the terminase OP mentioned -- I don't know), but it depends how wide you want to cast the net. Things get weird quick when you study viruses.

ADD REPLY • link 4.5 years ago by Joe 21k

0

Entering edit mode

Ah, thanks for the heads up. Indeed, especially with phages and the replication rate. Would it be an option for the OP the find some genes that are present in all the genomes so more then only one terminase and paste them together as one sequence and use that to make an alignment?

ADD REPLY • link 4.5 years ago by gb ★ 2.2k

0

Entering edit mode

Thats one option I think yeah, but it would need some good literature backup, ideally with experimental proof that those genes are decently conserved and don't recombine too much. Even then, that probably only gets you out to the Family level if you're lucky. If you're comparing between phage families life is really hard (mainly because phage families have been historically determined morphologically, rather than genetically/evolutionarily).

There are tools available to simply cluster all of the orthologues in the genome though, so that should give you maximal evolutionary signal, and you might as well use as many genes as possible rather than trying to cherry pick a few and risk making dodgy assumptions.

ADD REPLY • link 4.5 years ago by Joe 21k

0

Entering edit mode

The trick is not to use whole genomes. Its impossible to fully align (at all, let alone accurately) that many genomes of even that comparatively small size.

I would suggest you look at using concatenated orthologue alignments (i.e., cluster all the genes, do multiple alignments with them, then concatenate the alignments and use that to calculate a tree), or use mash distances as a surrogate for sequence identity and draw a tree using those as your sequence metric.

What you're currently trying to do is simply never going to work.

ADD REPLY • link 4.5 years ago by Joe 21k

0

Entering edit mode

Thanks! I have performed gene level (using terminase subunit of the genomes) and generated tree from that. Do you think that analysis is considerably OK to predict the family or order the assembled genome belong to? I am working on phages.

or use mash distances as a surrogate for sequence identity and draw a tree using those as your sequence metric.

Could you please guide me on how to do this? Thanks!

ADD REPLY • link 4.5 years ago by DanielC ▴ 170

1

Entering edit mode

I would suggest that a tree from a single gene, in something as highly recombinant as a phage, is unlikely to be enough. You probably need to do this a couple of times with other genes. If there is good evidence in the literature that that terminase is a reliable marker gene (akin to a 'housekeeping' gene in bacteria) then you might be fine (but I don't know for your specific case).

The best thing to do would be to find several genes (ideally as many as possible) and then try to reconcile their trees from each gene alignment.

For using mash distances, its pretty trivial, take a look at: https://github.com/lskatz/mashtree

Note, mash distances are not true evolutionary distances - they are more like an approximation. This would probably be sufficient to give you a reasonably accurate topology, given sufficient data, but you may not want to read too much in to branch lengths etc.

ADD REPLY • link 4.5 years ago by Joe 21k