many taxa and many seqs phylogeny
1
0
Entering edit mode
3 months ago
sapuizait ▴ 10

Hi all

I am trying to build a phylogeny using 1500 bacterial genomes with a concatenated alignment of 1100 genes (aminoacid sequences).

Finding the orthologs and building the alignment was not (too) difficult but when I tried to build a tree out of this monster it becomes a bit complicated... I have been always using RaxML, now switched to RaxML-ng almost exclusively. Usually with these concatenated alignments, I first run modeltest to find the appropriate model for each gene/protein alignment, then concatenate it and then feed it to RaxML to do bootstraps and find a best tree to fit.

I am using a cluster with 64 threads and 500GB RAM. However, for a whole day (23h) RAxML is now stuck at the very first step: "Starting ML tree search with 20 distinct starting trees".

I understand that trying to build a tree with that amount of info may be wishful thinking, but what is the alternative? How do people with even bigger alignments manage? Should I switch to fasttree or IQtree? Does anyone have good experience with those?

for the record here is the command I used:

raxml-ng --all --data-type AA --threads 64 --msa concatenated.phy --model partitions3 --bs-trees 100

Thank you in advance for the advice

raxML fasttree phylogeny • 412 views
ADD COMMENT
2
Entering edit mode
3 months ago
Mensur Dlakic ★ 27k

We don't know how many columns you have in the alignment, but let's say 150 per protein. That would make a tree that is based on a 1500 x 165,000 matrix. To give you a sense of your challenge, on a fast computer with 20 CPUs it takes about half a day for a 150 x 15,000 matrix. Not even sure that this scales linearly, or that you have enough memory for it. I don't mean this in an insulting way, but what you are trying to do is crazy. And unnecessary.

Have you ever looked at a tree with more than 200-300 branches? It is almost impossible to place even 1/4 of that tree at once on the screen, let alone of a tree that has 1500 branches.

The question I'd be asking myself: do I really need 1500 bacterial genomes? It is a safe bet that many of them are very related, or even strains of the same species. Why not use only a single representative for 10 related strains/species with a knowledge that all others would be right next to it in a tree?

Next thing I would ask myself: is the tree placement going to be materially different when I concatenate 1100 proteins instead of 50-100 carefully chosen single-copy genes? Not sure what kind of resolution you are hoping to achieve, but most large-scale concatenated alignments are based on a small number of single-copy genes - see how they do it at GTDB:

https://gtdb.ecogenomic.org/

If I didn't dissuade you from doing things the way you described:

ADD COMMENT
0
Entering edit mode

Thank you - I admit I got sidetracked by someone who told me that its feasible so I thought, why not, but you have a great point. There is no need to use 1100 genes, I ll make a selection of some housekeeping+other relevant ones and I will make a phylogeny.

I looked a bit in FastTree, the efficiency is impressive but as far as I can see you need to apply either a JTT or LG model in the entire alignment - no option for concatenated and partitions.

ADD REPLY

Login before adding your answer.

Traffic: 2161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6