Question: What Is The Best Method For Aligning Two Genome Assemblies?
5
gravatar for Aaronquinlan
3.2 years ago by
Aaronquinlan7.4k
United States
Aaronquinlan7.4k wrote:

I would like to align the contigs from the recent [1] assembly of NA12878 to the latest human genome reference sequence (hg19). I have considered using BWA-SW, BLAT and LASTZ. I would greatly prefer to use the SAM/BAM format because it will facilitate my downstream analysis. However, BWA-SW prefers query sequences in the 1-2Mb range, while this assembly has contigs in the tens of megabases. LASTZ, on the other hand, is not well-suited for aligning to many chromosomes at once. BLAT is difficult because the PSL to BAM conversion is imperfect.

Has anyone done this?

If you were to do this, what tool would you use or how would you go about it?

[1] Gnerre et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA (2011) vol. 108 (4) pp. 1513-8

ADD COMMENTlink modified 3.2 years ago by lh320k • written 3.2 years ago by Aaronquinlan7.4k
5
gravatar for lh3
3.2 years ago by
lh320k
lh320k wrote:

Probably you want to try this:

http://www.citeulike.org/group/10570/article/8403903

I would probably split long contigs into 1Mbp chunks and use BWA-SW (I actually wanted to do this but have not got time). By the way, they get tens of Mbp contigs? How long are scaffolds/supercontigs?

EDIT:

Perhaps also try this:

http://www.cs.utoronto.ca/~brudno/721.full.pdf

Just read the NA12878 paper. The contig N50 is 24kb. I would certainly map contigs rather than supercontigs.

EDIT2:

Aaron, have you tried Mugsy (the one described by the link above)? As I read the paper just now, it may need tens of CPU days to align two human assemblies. For a 1000g request, I have mapped the NA12878 contigs using BWA-SW.

ADD COMMENTlink modified 2.9 years ago • written 3.2 years ago by lh320k

Ah, thanks Heng. A colleague recently mentioned Salzberg's new aligner, but I had forgotten all about it. Yes, there are 80 contigs > 10Mb and 357 > 1Mb.

ADD REPLYlink written 3.2 years ago by Aaronquinlan7.4k

11.5Mb is the N50 of scaffolds. The contigs are only 24kb. BWA-SW will not align through the holds between contigs, so aligning contigs is preferred. Nonetheless, the whole-genome aligner may be a better choice. I do not know.

ADD REPLYlink written 3.2 years ago by lh320k

Yes, you're right. Sorry for the confused nomenclature.

ADD REPLYlink written 3.2 years ago by Aaronquinlan7.4k

@lh3: I've just tried a simple example between human-mouse: making mouse PAX2, PAX5 and PAX8 contigs from a 300x Illumina sequencing simulation assembled with Abyss, and then try to align the mouse contigs to human using bwa bwasw. It's not good, even with high -Z values: "samtools view ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/bwasw/mouse.pax5.x300.contigs.fa.fasta.human.bwasw.100000.bam"

ADD REPLYlink written 2.8 years ago by 2184687-1231-83-4.5k

If you have RNA-seq contigs, gmap and blat may be a better choice. I was mostly talking about mapping genomic sequences.

ADD REPLYlink written 2.8 years ago by lh320k

these are the whole pax genomic regions, 60~100K, for example http://www.ensembl.org/Mus_musculus/Location/View?g=ENSMUSG00000004231;r=19:44831882-44910520

ADD REPLYlink written 2.8 years ago by 2184687-1231-83-4.5k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 370 users visited in the last hour