Question

Phylogenetic Tree from Massive Multifasta Alignment?

0

Entering edit mode

2.6 years ago

jdru ▴ 10

Hi all,

I have a very large (~30,000 sequence, each ~17000 bases) multifasta alignment and I am wondering if this is too large to construct a phylogenetic tree? If not, which program would be most appropriate for this use case?

Thank you!

tree alignment fasta phylogeny • 1.8k views

ADD COMMENT • link 2.6 years ago by jdru ▴ 10

0

Entering edit mode

How was the multifasta generated? Generally I would be very skeptical of the quality of any MSA of that size. Most tools break down long before that.

ADD REPLY • link 2.6 years ago by Joe 21k

0

Entering edit mode

It was generated with MAFFT. I agree, the construction of the tree is actually part of post-processing/quality checking

ADD REPLY • link 2.6 years ago by jdrubin • 0

0

Entering edit mode

I would suggest using RAxML-NG or iqtree. I believe that iqtree is faster than RAxML though.

ADD REPLY • link 2.6 years ago by Sej Modha 5.3k

1

Entering edit mode

Unless OP has thousands of cores, I think he would be better off with e.g. fasttree

ADD REPLY • link 2.6 years ago by 5heikki 11k

0

Entering edit mode

IIRC iqtree has a fast mode which performs comparatively to fasttree

ADD REPLY • link 2.6 years ago by Joe 21k

0

Entering edit mode

Just curious: any reason you have and use two accounts?

ADD REPLY • link 2.6 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Oh sorry, I forgot I had already made an account this summer to ask a question (before getting my DTU email). I will go delete the old one.

ADD REPLY • link 2.6 years ago by jdru ▴ 10

score 1 · Answer 1 · 2021-09-28

Unless you are starting a new classification (new tree of life?) or building some sort of public database, 30K sequences is completely unnecessary. For just about any other purpose I can think of, that many sequences is an overkill. For publications or for grants, it is not practical to inspect trees that have more than few hundred branches, and even those would have to be collapsed into groups.

Your purpose for doing this aside, it will be difficult to get this tree to converge. With IQ-TREE in the fast bootstrap mode (a minimum of 1000 bootstraps which may not be enough for you) and 20-40 CPUs, it takes half a day for a protein alignment of ~150 sequences that are ~15,000 residues each. This may give you some idea about the time needed when you scale it up to what you have - and I don't think it scales up linearly.

If you still want to do it, you may want to give this a look:

https://cme.h-its.org/exelixis/web/software/examl/index.html