I'm currently evaluating the use of samtools (and other tools) for SNV detection from RNA-seq data. A particular pipeline setup includes a preprocessing step to split reads containing long skipped regions into several reads (see GATK Best practices). This preprocessing is expected to reduce the number of false positive SNVs and actually improves the accuracy of samtools. However, in some cases, SNVs are lost which I would like to have retained.
I narrowed it down do the BAQ calculation (paper, additional notes) but do not understand how the "removal" of a small part of the read has such a heavy influence on the BAQ. For example, consider the alignment of the attached screenshot which makes a very good example - even though we probably do not want this variant to be retained. A total of 16 reads is aligned with three reads displaying a variant while being mapped with a long skipped region of 1394 bases. This clearly is kind of a toy example because it only has limited amounts of reads. But, the same results can be reproduced at other locations of my mapping which have a higher coverage and higher allelic frequency.
Samtools correctly piles up this position as can be seen from the following execution on the original bam file (upper track in the screenshot):
$ samtools mpileup -f $GENOME -r chr12:85277730-85277730 original.bam chr12 85277730 C 15 ,......,...,TTT FHEJJIIDJHGCDFF
After splitting the three reads containing a skipped region in the mapping (middle track in the screenshot), samtools will not display the mismatches in the pileup:
$ samtools mpileup -f $GENOME -r chr12:85277730-85277730 splitted.bam chr12 85277730 C 12 ,......,..., FHEJJIIDJHGC
except, if the Base Alignment Quality (BAQ) threshold is lowered to zero (default: 13). Now, the mismatches are reported and one can identify the very low quality of the bases (exclamation marks in the quality string):
$ samtools mpileup -f $GENOME -r chr12:85277730-85277730 -Q0 splitted.bam chr12 85277730 C 15 ,......,...,TTT FHEJJIIDJHGC!!!
Obviously, if the BAQ calculation is omitted, the mismatches are piled up correctly, too:
$ samtools mpileup -f $GENOME -r chr12:85277730-85277730 -B splitted.bam chr12 85277730 C 15 ,......,...,TTT FHEJJIIDJHGCDFF
The same observations can be made if the "overhang" of the reads is hard- or soft-clipped (see lower track in the screenshot). Furthermore, the pileup results are reproducible with samtools v0.1.19 and v1.2.
I read the writings of Li on the BAQ (paper, additional notes) and believe it to be a good idea. However, I am wondering why this small change of the read alignment ("removal" of 7 matches) has such a huge influence on the BAQ values for the mismatches. In particular, as the trimming of the 7 bases is "at the other end" of the read. Probably someone has an idea and protects me from (re-)calculating the BAQ manually to reconstruct the behavior... :-(
Thanks in advance, Manuel
Note on read splitting: a read with a skipped region is splitted by duplicating the read and hard-clipping either the part before or after the skipped region. The additional read is flagged in the BAM file as 'supplementary' to make samtools aware that the mapping of the read can not be represented in a linear alignment.