5.7 years ago by
Boston, United States
As best I can tell, this is a new feature arising from
bwa mem's ability to generate chimeric alignments. This is where one read aligns jointly to multiple positions in the reference genome, for example the first half of the read to somewhere on chr1 and the second half to somewhere on chr2. Note that this is different from a multi-mapping read, where the entire read may be mapped multiple places.
To handle the split read case,
bwa mem will generate a separate SAM record (line in the SAM file) for each aligning segment of a read. So if for example the first read of a pair gets split into two mapping segments, you could have three lines in the SAM file from that read pair (say two from the first read, one from the second). I believe it is possible for this to happen and still have all records marked as properly paired, if orientation and insert size constraints are fulfilled (flag
0x2 set in all SAM records). If this happens, you could get the odd numbers you observe.
The SAM spec has evolved to include a new flag,
0x800, that denotes the supplementary reads (all but the first, defined arbitrarily I think) in a multi-part (chimeric) alignment. I predict that if you first remove reads with the
0x800 flag set and then run
flagstat, you will get an even number for the properly-paired count.
A note for completeness:
flagstat just does very simple counts of how many SAM records have various flag fields set. The flag values depend completely on what the orginating aligner decided to do. Records
properly paired are just those with flag
0x2, and records
with itself and mate mapped are just those records where neither flag
0x8 are set. Furthermore, I believe
samtools currently ignores the new