Large Scale Protein Alignment
3
2
Entering edit mode
9.1 years ago
jzabilansky ▴ 60

I am currently trying to perform data analysis on a data set containing over 25,000 sequences and wish to align them, is there a way I can do this efficiently that won't cause an alignment program to crash because of the size of the data?

protein multiple alignment • 2.8k views
ADD COMMENT
1
Entering edit mode

Which programs have you used? e.g. have you tried clustal - http://www.clustal.org/omega/

ADD REPLY
0
Entering edit mode

Can you tell us a bit more about your 25,000 sequences? Are they all for the same gene? A gene family? You want to do global alignments or assemble them?

ADD REPLY
0
Entering edit mode

They are all for the same gene and I wish to do global alignments.

ADD REPLY
0
Entering edit mode

25000 sequences for the same gene sounds like an awful lot. Have you considered trimming the set a bit, and maybe just extracting the N most informative sequences? I know this can be done using t_coffee, but I'm not sure if that is suitable for such a big data set.

ADD REPLY
4
Entering edit mode
9.1 years ago
Hamish ★ 3.2k

Assuming that these are protein sequences you want to align, then as Niallhaslam suggests, Clustal Omega sounds like the best option (as noted in "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega" Clustal Omega has been tested with alignments of up to 200,000 sequences).

However if your sequences are DNA or RNA, I would suggest you look at MAFFT or Kalign instead. Since the method used in Clustal Omega, does not perform as well with nucleotide alignments (this is being worked on).

If your sequences are short and very similar then other multiple sequence alignment programs, such as MUSCLE and T-Coffee, might work, although the alignment may still require a lot of memory to complete successfully.

ADD COMMENT
0
Entering edit mode

Second Clustal Omega if these are amino acid sequences you want to align. Although I would try and do some perfunctory pruning prior. You probably have a lot of sequences that are identical or nearly identical that you can ditch.

ADD REPLY
1
Entering edit mode
9.1 years ago
jomaco ▴ 200

If you wish to align those proteins to a reference assembly you could use the exonerate (http://www.ebi.ac.uk/~guy/exonerate/) protein2genome model which models introns. I used this when I wanted to align proteins from the TAIR10 database to our reference genome. You would also probably want to split the file into considerably smaller chunks so that many faster individual alignments can be carried out before the results are merged - this way the alignment as a whole will be much quicker.

Edit: I assumed the proteins were being aligned to a reference sequence rather than to each other (in which case this solution would not be appropriate).

ADD COMMENT
1
Entering edit mode

I'm glad you made the wrong assumption, as this is exactly what I wanted! In the spirit of stack exchange, perhaps I should write a specific question for you to answer? Hey! I just did: How to align a protein set to a genome?

ADD REPLY
0
Entering edit mode
9.1 years ago
Abhiman ▴ 130

Kalign is a fast alignment program, which I have used to align large number of sequences (~50,000). It is available here http://msa.sbc.su.se/downloads/kalign/current.tar.gz

ADD COMMENT
0
Entering edit mode

Can you add Kalign to SEQwiki?

ADD REPLY

Login before adding your answer.

Traffic: 896 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6