Has anyone already tried MIT CORA for read alignment? Or have any ideas about compressive mapping in general.
I.e. potential draw backs or how much impact this will probably have on read alignment?
MIT CORA should offer significant speed up for read alignment. This by first efficiently compressing the input FASTQ files and having only the non redundant sequence data aligned by existing read aligners like BWA. The output is still standard BAM.
Because of the compression and alignment of only non redundant sequence data the method should scale sub-linear with increasing input data size. The more data, the higher the speedup because the non redundant data size will stay similar.
Also the paper states that the sensitivity or specificity of read alignment stays almost the same.
Compressive mapping for nextgeneration sequencing
The software is developed by the Bonnie Berger group at MIT. There is also this quote related to the software/ work that they are doing :
As biotech enters the age of massive data analytics, we gain the ability to reveal biological phenomena and personalize medicine. But large, noisy data provide new challenges of scale and precision. To enable efficient and effective analyses, we need to develop technologies that allow direct operation on compressed data by taking advantage of evolutionary constraints on their topological footprint.
Questions that I have are
- Did someone else already validate the results ( i.e. v.s. genome in a bottle na12878 ) ?
- Would you expect this to work just as well on germ line and somatic (cancer) sequencing data? There are of course much more novel mutations / haplotypes / DNA reads in cancer.
- Doesn't the computation bottleneck not just shift to going to and from the non-redundant data? redundant FASTQ -> non-redundant data / alignment -> redundant BAM format?
I have tried cora. If you need a fast mapper, I would highly recommend SNAP instead, though you will need to use GATK-HC for INDEL calling.
Curious to know if CORA does work "as advertised" in terms of speed-up? Sometimes these claims can be hard to verify/replicate due to differences in infrastructure.
For my current volume of samples BWA mem + a cluster is just fine.
Was more wondering in general if compressive mapping will be the general direction that read alignment is going.
Many mappers compress the genome in a way or another. A few also compress reads. "Compressive mapping" may be a new name, but is not a new concept. Cora is just a different way to achieve that and it is not as good as many other mappers in my view. At least for best mapping, their evaluation is actually biased. As cora can only do edit-distance-based mapping allowing up to 4 diffs, the authors ask other mappers to perform the same type of mapping, a type that is not of much use in practice.