Twice in just as many days, we were left scratching our heads as mainstream tools broke unexpectedly and spectacularly on what seemed to be normal tasks.
For example take the SAM file that we now fittingly named
samtools view -b http://data.biostarhandbook.com/strange/cigarbreak.sam > output.bam
thanks it's close, but no cigar,
E::bam_write1] too many CIGAR operations (95243 >= 64K for QNAME "placed") samtools view: writing to standard output failed: Value too large to be stored in data type
Turns out this particular SAM file cannot be turned into a BAM file. Suprised? Rightfully so. Why does the BAM format limit how many operations may be present in the alignment?
Summarizing the comments/answers:
- This is limitation is caused by the way the BAM format stores the value internally
16 bits short = 2^16 = 65536 max operations
- The CRAM format does not have this limitation.
Align it like it's still 1999
Here is another unexpected behavior. The hisat2 aligner will, occasionally, refuse to align a read that has an insertion in it.
wget http://data.biostarhandbook.com/strange/reference.fa wget http://data.biostarhandbook.com/strange/query.fa bwa index reference.fa hisat2-build reference.fa reference.fa
The two sequences are identical except the query has an insertion after position 49. Let's do the alignments:
bwa mem reference.fa query.fa | cut -f 1,2,6
among other things it prints:
sequence 0 49M1I134M
This is the correct answer. The query sequence has
1 insertion followed by
134 matches. There are no mismatches by the way. Now let's run the same sequence via hisat2:
hisat2 -f -x reference.fa query.fa | cut -f 1,2,6
the output is:
sequence 4 *
the query sequence is not aligned at all. Make no mistake
hisat2 is supposed to be an indel aware aligner. It fails for this particular sequence. Amusingly we detected this when we used
hisat2 to "check and validate" results produced via
Post your own "stranger things" observations as followup answers.