Question

many taxa and many seqs phylogeny

0

Entering edit mode

3 months ago

sapuizait ▴ 10

Hi all

I am trying to build a phylogeny using 1500 bacterial genomes with a concatenated alignment of 1100 genes (aminoacid sequences).

Finding the orthologs and building the alignment was not (too) difficult but when I tried to build a tree out of this monster it becomes a bit complicated... I have been always using RaxML, now switched to RaxML-ng almost exclusively. Usually with these concatenated alignments, I first run modeltest to find the appropriate model for each gene/protein alignment, then concatenate it and then feed it to RaxML to do bootstraps and find a best tree to fit.

I am using a cluster with 64 threads and 500GB RAM. However, for a whole day (23h) RAxML is now stuck at the very first step: "Starting ML tree search with 20 distinct starting trees".

I understand that trying to build a tree with that amount of info may be wishful thinking, but what is the alternative? How do people with even bigger alignments manage? Should I switch to fasttree or IQtree? Does anyone have good experience with those?

for the record here is the command I used:

raxml-ng --all --data-type AA --threads 64 --msa concatenated.phy --model partitions3 --bs-trees 100

Thank you in advance for the advice

raxML fasttree phylogeny • 412 views

ADD COMMENT • link 3 months ago by sapuizait ▴ 10

score 2 · Answer 1 · 2024-01-12

We don't know how many columns you have in the alignment, but let's say 150 per protein. That would make a tree that is based on a 1500 x 165,000 matrix. To give you a sense of your challenge, on a fast computer with 20 CPUs it takes about half a day for a 150 x 15,000 matrix. Not even sure that this scales linearly, or that you have enough memory for it. I don't mean this in an insulting way, but what you are trying to do is crazy. And unnecessary.

Have you ever looked at a tree with more than 200-300 branches? It is almost impossible to place even 1/4 of that tree at once on the screen, let alone of a tree that has 1500 branches.

The question I'd be asking myself: do I really need 1500 bacterial genomes? It is a safe bet that many of them are very related, or even strains of the same species. Why not use only a single representative for 10 related strains/species with a knowledge that all others would be right next to it in a tree?

Next thing I would ask myself: is the tree placement going to be materially different when I concatenate 1100 proteins instead of 50-100 carefully chosen single-copy genes? Not sure what kind of resolution you are hoping to achieve, but most large-scale concatenated alignments are based on a small number of single-copy genes - see how they do it at GTDB:

https://gtdb.ecogenomic.org/

If I didn't dissuade you from doing things the way you described: