Has anyone already tried MIT CORA for read alignment? Or have any ideas about compressive mapping in general.
I.e. potential draw backs or how much impact this will probably have on read alignment?
MIT CORA should offer significant speed up for read alignment. This by first efficiently compressing the input FASTQ files and having only the non redundant sequence data aligned by existing read aligners like BWA. The output is still standard BAM.
Because of the compression and alignment of only non redundant sequence data the method should scale sub-linear with increasing input data size. The more data, the higher the speedup because the non redundant data size will stay similar.
Also the paper states that the sensitivity or specificity of read alignment stays almost the same.
Compressive mapping for nextgeneration sequencing
The software is developed by the Bonnie Berger group at MIT. There is also this quote related to the software/ work that they are doing :
As biotech enters the age of massive data analytics, we gain the ability to reveal biological phenomena and personalize medicine. But large, noisy data provide new challenges of scale and precision. To enable efficient and effective analyses, we need to develop technologies that allow direct operation on compressed data by taking advantage of evolutionary constraints on their topological footprint.
Questions that I have are
- Did someone else already validate the results ( i.e. v.s. genome in a bottle na12878 ) ?
- Would you expect this to work just as well on germ line and somatic (cancer) sequencing data? There are of course much more novel mutations / haplotypes / DNA reads in cancer.
- Doesn't the computation bottleneck not just shift to going to and from the non-redundant data? redundant FASTQ -> non-redundant data / alignment -> redundant BAM format?