Question: Fasta File Alignment with >2M sequences
gravatar for FionaK78
9 months ago by
FionaK780 wrote:

Hi Appreciate any advise or links if it was discussed before. I merged multiple fasta files into one big fasta file (>2M sequences) and when i try to do alignment using mafft, first of all, it takes too long and secondly, it produces output with 0 size. I tried to use muscle but it appeared it cannot handle fasta files with too many sequences. What are my options to perform alignment of so many sequences?

Many thanks Fiona

alignment • 202 views
ADD COMMENTlink modified 9 months ago by Mensur Dlakic8.1k • written 9 months ago by FionaK780
gravatar for Mensur Dlakic
9 months ago by
Mensur Dlakic8.1k
Mensur Dlakic8.1k wrote:

Can't imagine what you are expecting to gain by aligning 2M sequences that you would not get from 50K or 100K sequences. It would be like getting accurate GPS coordinates for every cricket on Earth. Maybe it can be done, but I can't imagine that anyone would look at all the individual data points when global understanding of the cricket population is probably more useful. I have done my share of looking at large protein families, and can't think of a single one where even 200K sequences are required to properly represent the diversity within a group, let alone 2M. Unless you are trying to break a record, I suggest you reconsider.

MUSCLE actually handles large number of sequences just fine, but not 2M large. The same is true for MAFFT and ClustalO, but again not 2M large. My suggestion is to first remove the redundancy of your sequences, say down to 40-50%. That can be done using mmseqs2 or CD-HIT. If you knock down the number of sequences by 10-15x, that should be doable by both programs you have tried already.

If you still have a desire to align >2M sequences,it can be done once you build an alignment of a smaller, representative group. Your small alignment can be used as a seed to create a hidden Markov model, and this model in turn can align the remaining sequences. (Did I already mention that nobody would look at an alignment that has 2M sequences?) This is how Pfam makes its large alignments (though I am pretty sure not 2M large) using HMMer.

ADD COMMENTlink written 9 months ago by Mensur Dlakic8.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 972 users visited in the last hour