Multiple Sequence Alignment Of Thousands Of Proteins
4
9
Entering edit mode
10.3 years ago
Dror ▴ 280

I want to track the evolution of several domains, and for doing so, I need to align and cluster 1000's of sequences. is it possible? and what is the best software to use for that? Eventually I want to understand which is the most "basal" sequence that might lead me to the most ancient protein containing this sequence.

alignment clustering evolution domain • 6.3k views
ADD COMMENT
12
Entering edit mode
10.3 years ago

"mafft --auto" is stable for up to hundreds of thousands of proteins and produces reasonable alignments: http://mafft.cbrc.jp/alignment/software/

ADD COMMENT
2
Entering edit mode

Just as an example, the 10 biggest alignments in the Ensembl Families are ~50000 sequences, 20000, 14000, 12000, 10000, 9200, 9100, 7800, 7500 and 6800 sequences, all aligned with mafft auto

ADD REPLY
0
Entering edit mode

Just as an example, the biggest Ensembl Families are aligned with mafft auto and they are big:

+----------+-----------+ | count(*) | family_id | +----------+-----------+ | 54909 | 1 | | 19735 | 2 | | 14461 | 4 | | 12625 | 5 | | 10452 | 3 | | 9223 | 6 | | 9178 | 57 | | 7842 | 9 | | 7568 | 7 | | 6810 | 8 | +----------+-----------+

ADD REPLY
4
Entering edit mode
10.3 years ago
Liam Thompson ▴ 140

Have you tried MUSCLE ? I've only used it for hundreds of sequences, and it produced a good alignment in good time. I think with a cluster or a beefy desktop it would probably work nicely.

ADD COMMENT
3
Entering edit mode
10.3 years ago

Hi Dror,

I am not aware of any application that accepts thousands of sequences and aligns with a greater accuracy. Fast Statistical alignment (http://fsa.sourceforge.net/) seems to accept a few hundred sequences, not sure how many exactly and if its going to furnish an accurate alignment. But if you really want to align that many sequences, why not partition the dataset, align them separately and then combine the alignments? I guess that will give you better alignments and will be less time consuming. Will let you know if I find any app that meets your requirement.

Cheers, Kartik

ADD COMMENT
3
Entering edit mode
10.3 years ago
Andreas ★ 2.5k

Even though this might be considered as shameless advertisement:

The new version of Clustal (Clustal Omega) is able to cope with this amount of (and many more) sequences when using the --mbed flag. See the announcement on the Clustal Homepage. It's currently a protein-only, command-line only, Unix-only, pre-publication beta version :)

Andreas

ADD COMMENT

Login before adding your answer.

Traffic: 1305 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6