Question

Fasta File Alignment with >2M sequences

0

Entering edit mode

5.2 years ago

FionaK78 • 0

Hi Appreciate any advise or links if it was discussed before. I merged multiple fasta files into one big fasta file (>2M sequences) and when i try to do alignment using mafft, first of all, it takes too long and secondly, it produces output with 0 size. I tried to use muscle but it appeared it cannot handle fasta files with too many sequences. What are my options to perform alignment of so many sequences?

Many thanks Fiona

alignment • 1.0k views

ADD COMMENT • link updated 5.2 years ago by Mensur Dlakic ★ 29k • written 5.2 years ago by FionaK78 • 0

score 0 · Answer 1 · 2020-04-16

Can't imagine what you are expecting to gain by aligning 2M sequences that you would not get from 50K or 100K sequences. It would be like getting accurate GPS coordinates for every cricket on Earth. Maybe it can be done, but I can't imagine that anyone would look at all the individual data points when global understanding of the cricket population is probably more useful. I have done my share of looking at large protein families, and can't think of a single one where even 200K sequences are required to properly represent the diversity within a group, let alone 2M. Unless you are trying to break a record, I suggest you reconsider.

MUSCLE actually handles large number of sequences just fine, but not 2M large. The same is true for MAFFT and ClustalO, but again not 2M large. My suggestion is to first remove the redundancy of your sequences, say down to 40-50%. That can be done using mmseqs2 or CD-HIT. If you knock down the number of sequences by 10-15x, that should be doable by both programs you have tried already.

If you still have a desire to align >2M sequences,it can be done once you build an alignment of a smaller, representative group. Your small alignment can be used as a seed to create a hidden Markov model, and this model in turn can align the remaining sequences. (Did I already mention that nobody would look at an alignment that has 2M sequences?) This is how Pfam makes its large alignments (though I am pretty sure not 2M large) using HMMer.