Question: Help With Multiple Whole Genome Alignment. Aligning Over 400 Whole Genomes
2
gravatar for agatorano
6.1 years ago by
agatorano40
agatorano40 wrote:

ClustalW is extremely limited when in comes to multiple whole genome sequencing. I have recently just looked at mugsy which claims to be able to align a little over 30 whole genomes.

Is there a software that can align 400 whole genomes? This would be over a Gb of data.

Any help would be enormously appreciated.

ADD COMMENTlink modified 6.1 years ago by Aaron H170 • written 6.1 years ago by agatorano40
1

If your purpose is to find orthologs between the genomes, Standalone blast from NCBI will help. Install via: ftp://ftp.ncbi.nih.gov/blast/ . No doubt, it`ll take huge amount of time.

ADD REPLYlink written 6.1 years ago by Nari860

I'm a little confused why you would want to do this. Assuming already assembled and annotated genomes, if whole genome alignment programs take sequences (fasta) and annotations (gff) and use both for alignment, why would you need to blast orthologs?

ADD REPLYlink written 6.1 years ago by Josh Herr5.6k

Progressive Mauve is a good choice. However, other helpful tools include Mummer and with a new ortholog finding tool that uses synteny - PanOCT - sourceforge.net/p/panoct/

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by bckirkup0

Sybil and OrthoMCL are 10 times less memory consuming than PanOCT. http://sourceforge.net/projects/panoct/screenshots/PanOCT_memory_usage.jpg/2000/2000

ADD REPLYlink written 5.9 years ago by Nari860
3
gravatar for Josh Herr
6.1 years ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

You haven't told us anything about your organisms or how large these genomes are. You must be talking about bacterial sized genomes because even aligning a handful of genomes from small Eukaryotes is not a trivial task.

I think you would be crazy to use a multiple aligner like Clustal for whole genomes (I think it's too slow and too error prone for this, but that's my opinion). There are tons of options for sequence alignment, you would probably be interested in the "genomics analysis" section of this list.

On this list I really like MUAVE, but my hands-down favorite is SyMAP.

Perhaps there is a better way than this suggestion, but I really like the quality of whole genome alignments I get with SyMAP. You'll need to use a Unix-flavor (Mac OS or Linux) to run SyMAP and you'll have to set up your own MySQL database, but once you have put all your genomic data and GFF files in the correct format (not a trivial thing GREPing 400 genomes), you should be good to go. You will set up a data analysis matrix and with SyMAP you'll run a pairwise genome alignment on all the combinations of your genomes to map synteny. This is going to take a long while (some 160,000 pairwise alignments?) so I suggest doing this on a cluster, you won't be able to run an analysis like this on your laptop or desktop.

UPDATE: I just checked out MUGSY (hadn't so far) and I have yet to install it but it looks promising. I guess you'd be looking at strains or genomes with little phylogenetic breadth with MUGSY?

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Josh Herr5.6k

Sorry we are indeed doing bacterial genomes. I am excited to hear that there may be many options. Will they be able to handle the quantity of genomes we are providing?

ADD REPLYlink written 6.1 years ago by agatorano40
1

I don't have an answer to the quantity issue, but you'll have to give it a try. I've aligned 36 whole Eukaryotic genomes in SyMAP and haven't had a problem yet. I'm not sure about some of the other aligners. You'll probably want to select a alignment program that only saves alignment data as the program progresses (SAM, BAM, BED, etc.) and not whole nucleotide alignment data (like a clustal output).

ADD REPLYlink written 6.1 years ago by Josh Herr5.6k

Oh wow if you aligned 36 eukaryotes it shouldn't be to terrible. Does SyMAP save data as the program progresses as you suggested?

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by agatorano40
1

Data is saved in a pairwise matrix, so you align each bacterial genome to every other bacterial genome in your dataset. You should be able to determine clade specific differences such as indel events and rearrangements. I'm interested in then taking this data and then mapping these traits across phylogenies of your genomes.

ADD REPLYlink written 6.1 years ago by Josh Herr5.6k

This is exactly what we are trying to do. Phylogenetic analysis of these prokaryotes is ideal. So you haven't used SyMAP to get to the completed phylogeny yet? If so any advice?

ADD REPLYlink written 6.1 years ago by agatorano40
1

You'll have so many characters for phylogenetic analysis it will be crazy. I think the indels, inversions, and genome rearrangements will make a really interesting story. I haven't done anything quite like this, but I think you could use a few hundred genes in your phylogeny and then map other characters, which you could code from your alignment files (SAM, BAM, etc.) to map across the phylogeny. It would be a really cool paper and one to be cited a lot.

ADD REPLYlink written 6.1 years ago by Josh Herr5.6k
1

I just saw this paper which is provisionally published yesterday: Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes. It may be of some use to you. There are probably lots of others...

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Josh Herr5.6k
1
gravatar for Aaron H
6.1 years ago by
Aaron H170
United States/San Francisco/UCSF
Aaron H170 wrote:

I've gotten very good results when aligning 12 drosophila genomes with FSA http://fsa.sourceforge.net/FAQ.html.

ADD COMMENTlink written 6.1 years ago by Aaron H170

Hi Aaron,

How much/long (ram, time, cpu) did it take to run the alignment of your 12 fly genomes with fsa?

ADD REPLYlink written 4.3 years ago by atisou0

Sorry for the delay, I should set up alerts. I followed Colin Dewey's pipeline of mercator to find orthologous regions and then ran FSA on a 8 core 64GB server and it took about three days beginning to end. 

ADD REPLYlink written 4.1 years ago by Aaron H170

Hi Aaron,

I am trying to align 13 insect genomes, so I try to use mercator - FSA as you suggested. Unfortunately I face problems when I try to compile mercator (make: *** [apps/mercator/util/maskRepetitive.o] Error 1) on Ubuntu machines, so I would like to ask you if you had any similar problems that you probably solved.

ADD REPLYlink written 3.5 years ago by mitsias0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1184 users visited in the last hour