Help With Multiple Whole Genome Alignment. Aligning Over 400 Whole Genomes
2
3
Entering edit mode
9.8 years ago
agatorano ▴ 50

ClustalW is extremely limited when in comes to multiple whole genome sequencing. I have recently just looked at mugsy which claims to be able to align a little over 30 whole genomes.

Is there a software that can align 400 whole genomes? This would be over a Gb of data.

Any help would be enormously appreciated.

1
Entering edit mode

If your purpose is to find orthologs between the genomes, Standalone blast from NCBI will help. Install via: ftp://ftp.ncbi.nih.gov/blast/ . No doubt, itll take huge amount of time.

0
Entering edit mode

I'm a little confused why you would want to do this. Assuming already assembled and annotated genomes, if whole genome alignment programs take sequences (fasta) and annotations (gff) and use both for alignment, why would you need to blast orthologs?

0
Entering edit mode

Progressive Mauve is a good choice. However, other helpful tools include Mummer and with a new ortholog finding tool that uses synteny - PanOCT - sourceforge.net/p/panoct/

0
Entering edit mode

Sybil and OrthoMCL are 10 times less memory consuming than PanOCT. http://sourceforge.net/projects/panoct/screenshots/PanOCT_memory_usage.jpg/2000/2000

3
Entering edit mode
9.8 years ago
Josh Herr 5.7k

You haven't told us anything about your organisms or how large these genomes are. You must be talking about bacterial sized genomes because even aligning a handful of genomes from small Eukaryotes is not a trivial task.

I think you would be crazy to use a multiple aligner like Clustal for whole genomes (I think it's too slow and too error prone for this, but that's my opinion). There are tons of options for sequence alignment, you would probably be interested in the "genomics analysis" section of this list.

On this list I really like MUAVE, but my hands-down favorite is SyMAP.

Perhaps there is a better way than this suggestion, but I really like the quality of whole genome alignments I get with SyMAP. You'll need to use a Unix-flavor (Mac OS or Linux) to run SyMAP and you'll have to set up your own MySQL database, but once you have put all your genomic data and GFF files in the correct format (not a trivial thing GREPing 400 genomes), you should be good to go. You will set up a data analysis matrix and with SyMAP you'll run a pairwise genome alignment on all the combinations of your genomes to map synteny. This is going to take a long while (some 160,000 pairwise alignments?) so I suggest doing this on a cluster, you won't be able to run an analysis like this on your laptop or desktop.

UPDATE: I just checked out MUGSY (hadn't so far) and I have yet to install it but it looks promising. I guess you'd be looking at strains or genomes with little phylogenetic breadth with MUGSY?

0
Entering edit mode

Sorry we are indeed doing bacterial genomes. I am excited to hear that there may be many options. Will they be able to handle the quantity of genomes we are providing?

1
Entering edit mode

I don't have an answer to the quantity issue, but you'll have to give it a try. I've aligned 36 whole Eukaryotic genomes in SyMAP and haven't had a problem yet. I'm not sure about some of the other aligners. You'll probably want to select a alignment program that only saves alignment data as the program progresses (SAM, BAM, BED, etc.) and not whole nucleotide alignment data (like a clustal output).

0
Entering edit mode

Oh wow if you aligned 36 eukaryotes it shouldn't be to terrible. Does SyMAP save data as the program progresses as you suggested?

1
Entering edit mode

Data is saved in a pairwise matrix, so you align each bacterial genome to every other bacterial genome in your dataset. You should be able to determine clade specific differences such as indel events and rearrangements. I'm interested in then taking this data and then mapping these traits across phylogenies of your genomes.

0
Entering edit mode

This is exactly what we are trying to do. Phylogenetic analysis of these prokaryotes is ideal. So you haven't used SyMAP to get to the completed phylogeny yet? If so any advice?

1
Entering edit mode

You'll have so many characters for phylogenetic analysis it will be crazy. I think the indels, inversions, and genome rearrangements will make a really interesting story. I haven't done anything quite like this, but I think you could use a few hundred genes in your phylogeny and then map other characters, which you could code from your alignment files (SAM, BAM, etc.) to map across the phylogeny. It would be a really cool paper and one to be cited a lot.

1
Entering edit mode

I just saw this paper which is provisionally published yesterday: Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes. It may be of some use to you. There are probably lots of others...

1
Entering edit mode
9.8 years ago
Aaron H ▴ 170

I've gotten very good results when aligning 12 drosophila genomes with FSA http://fsa.sourceforge.net/FAQ.html.

0
Entering edit mode

Hi Aaron,

How much/long (ram, time, cpu) did it take to run the alignment of your 12 fly genomes with fsa?

0
Entering edit mode

Sorry for the delay, I should set up alerts. I followed Colin Dewey's pipeline of mercator to find orthologous regions and then ran FSA on a 8 core 64GB server and it took about three days beginning to end.

0
Entering edit mode

Hi Aaron,

I am trying to align 13 insect genomes, so I try to use mercator - FSA as you suggested. Unfortunately I face problems when I try to compile mercator (make: *** [apps/mercator/util/maskRepetitive.o] Error 1`) on Ubuntu machines, so I would like to ask you if you had any similar problems that you probably solved.