I am currently trying to perform data analysis on a data set containing over 25,000 sequences and wish to align them, is there a way I can do this efficiently that won't cause an alignment program to crash because of the size of the data?
I am currently trying to perform data analysis on a data set containing over 25,000 sequences and wish to align them, is there a way I can do this efficiently that won't cause an alignment program to crash because of the size of the data?
Assuming that these are protein sequences you want to align, then as Niallhaslam suggests, Clustal Omega sounds like the best option (as noted in "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega" Clustal Omega has been tested with alignments of up to 200,000 sequences).
However if your sequences are DNA or RNA, I would suggest you look at MAFFT or Kalign instead. Since the method used in Clustal Omega, does not perform as well with nucleotide alignments (this is being worked on).
If your sequences are short and very similar then other multiple sequence alignment programs, such as MUSCLE and T-Coffee, might work, although the alignment may still require a lot of memory to complete successfully.
If you wish to align those proteins to a reference assembly you could use the exonerate (http://www.ebi.ac.uk/~guy/exonerate/) protein2genome model which models introns. I used this when I wanted to align proteins from the TAIR10 database to our reference genome. You would also probably want to split the file into considerably smaller chunks so that many faster individual alignments can be carried out before the results are merged - this way the alignment as a whole will be much quicker.
Edit: I assumed the proteins were being aligned to a reference sequence rather than to each other (in which case this solution would not be appropriate).
I'm glad you made the wrong assumption, as this is exactly what I wanted! In the spirit of stack exchange, perhaps I should write a specific question for you to answer? Hey! I just did: How to align a protein set to a genome?
Kalign is a fast alignment program, which I have used to align large number of sequences (~50,000). It is available here http://msa.sbc.su.se/downloads/kalign/current.tar.gz
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Which programs have you used? e.g. have you tried clustal - http://www.clustal.org/omega/
Can you tell us a bit more about your 25,000 sequences? Are they all for the same gene? A gene family? You want to do global alignments or assemble them?
They are all for the same gene and I wish to do global alignments.
25000 sequences for the same gene sounds like an awful lot. Have you considered trimming the set a bit, and maybe just extracting the N most informative sequences? I know this can be done using t_coffee, but I'm not sure if that is suitable for such a big data set.