Twice in just as many days, we were left scratching our heads as mainstream tools broke unexpectedly and spectacularly on what seemed to be normal tasks.
For example take the SAM file that we now fittingly named
samtools view -b http://data.biostarhandbook.com/strange/cigarbreak.sam > output.bam
thanks it's close, but no cigar,
E::bam_write1] too many CIGAR operations (95243 >= 64K for QNAME "placed") samtools view: writing to standard output failed: Value too large to be stored in data type
Turns out this particular SAM file cannot be turned into a BAM file. Suprised? Rightfully so. Why does the BAM format limit how many operations may be present in the alignment?
Summarizing the comments/answers:
- This is limitation is caused by the way the BAM format stores the value internally
16 bits short = 2^16 = 65536 max operations
- The CRAM format does not have this limitation.
Align it like it's still 1999
Here is another unexpected behavior. The hisat2 aligner will, occasionally, refuse to align a read that has an insertion in it.
wget http://data.biostarhandbook.com/strange/reference.fa wget http://data.biostarhandbook.com/strange/query.fa bwa index reference.fa hisat2-build reference.fa reference.fa
The two sequences are identical except the query has an insertion after position 49. Let's do the alignments:
bwa mem reference.fa query.fa | cut -f 1,2,6
among other things it prints:
sequence 0 49M1I134M
This is the correct answer. The query sequence has
1 insertion followed by
134 matches. There are no mismatches by the way. Now let's run the same sequence via hisat2:
hisat2 -f -x reference.fa query.fa | cut -f 1,2,6
the output is:
sequence 4 *
the query sequence is not aligned at all. Make no mistake
hisat2 is supposed to be an indel aware aligner. It fails for this particular sequence. Amusingly we detected this when we used
hisat2 to "check and validate" results produced via
Post your own "stranger things" observations as followup answers.
Turns out the cigar string limitation was already mentioned here: Splitting SAM/BAM alignments
The hisat2 example seems like a bug to me. Did you fill in a bug report?
edit: turns out the hisat2 indel issue has been reported before (Q: HISAT2 (graph index) alignment over InDels), and vivekbhr reported the issue to the authors.
edit 2: which version of hisat2? There is one fresh out of the oven - HISAT2 2.1.0 release 6/8/2017 - which maybe fixes the issue?
Fascinating looks like there is nothing new under the Biostar. Both issues have already been reported
In my observation, in general, insertions and deletions are handled correctly - the failure only occurs in some cases.
I've checked out the latest source of
hisat2, the problem still persists.
Filed an issue here: https://github.com/infphilo/hisat2/issues/129
Figuring out what to put effort into allocating memory for can be a hard problem, at the start, especially for tools written when memory and disk resources were considerably more constrained. Programming is hard work! Especially when feature creep sets in and people expect more from a format than it initially allows.