Question

10 to 1000X speedup of BWA read alignment trough compressive mapping: MIT CORA

4

Entering edit mode

9.2 years ago

William ★ 5.4k

Has anyone already tried MIT CORA for read alignment? Or have any ideas about compressive mapping in general.

I.e. potential draw backs or how much impact this will probably have on read alignment?

MIT CORA should offer significant speed up for read alignment. This by first efficiently compressing the input FASTQ files and having only the non redundant sequence data aligned by existing read aligners like BWA. The output is still standard BAM.

Because of the compression and alignment of only non redundant sequence data the method should scale sub-linear with increasing input data size. The more data, the higher the speedup because the non redundant data size will stay similar.

Also the paper states that the sensitivity or specificity of read alignment stays almost the same.

Compressive mapping for nextgeneration sequencing

Paper: http://www.nature.com/nbt/journal/v34/n4/full/nbt.3511.html

Software: http://groups.csail.mit.edu/cb/cora/

The software is developed by the Bonnie Berger group at MIT. There is also this quote related to the software/ work that they are doing :

As biotech enters the age of massive data analytics, we gain the ability to reveal biological phenomena and personalize medicine. But large, noisy data provide new challenges of scale and precision. To enable efficient and effective analyses, we need to develop technologies that allow direct operation on compressed data by taking advantage of evolutionary constraints on their topological footprint.

Questions that I have are

Did someone else already validate the results ( i.e. v.s. genome in a bottle na12878 ) ?
Would you expect this to work just as well on germ line and somatic (cancer) sequencing data? There are of course much more novel mutations / haplotypes / DNA reads in cancer.
Doesn't the computation bottleneck not just shift to going to and from the non-redundant data? redundant FASTQ -> non-redundant data / alignment -> redundant BAM format?

bwa fastq alignment compression • 3.0k views

ADD COMMENT • link updated 9.1 years ago by Biostar 20 • written 9.2 years ago by William ★ 5.4k

2

Entering edit mode

I have tried cora. If you need a fast mapper, I would highly recommend SNAP instead, though you will need to use GATK-HC for INDEL calling.

ADD REPLY • link 9.2 years ago by lh3 33k

0

Entering edit mode

Curious to know if CORA does work "as advertised" in terms of speed-up? Sometimes these claims can be hard to verify/replicate due to differences in infrastructure.

ADD REPLY • link 9.2 years ago by GenoMax 152k

0

Entering edit mode

For my current volume of samples BWA mem + a cluster is just fine.
Was more wondering in general if compressive mapping will be the general direction that read alignment is going.

ADD REPLY • link 9.2 years ago by William ★ 5.4k

5

Entering edit mode

Many mappers compress the genome in a way or another. A few also compress reads. "Compressive mapping" may be a new name, but is not a new concept. Cora is just a different way to achieve that and it is not as good as many other mappers in my view. At least for best mapping, their evaluation is actually biased. As cora can only do edit-distance-based mapping allowing up to 4 diffs, the authors ask other mappers to perform the same type of mapping, a type that is not of much use in practice.

ADD REPLY • link 9.2 years ago by lh3 33k