My last question has led me to assumption that the CIGAR string in SAM/BAM files is possibly not very well-defined. Summarized: you cannot calculate a string-difference (e.g. Levenshtein distance) from a CIGAR string and therefore, the sequence-similarity within the aligned region cannot be computed.
The reason for this is quite trivial, CIGAR doesn't differentiate matches and mismatches:
According to the SAM format specification the
M character in a CIGAR
M alignment match (can be a sequence match or mismatch)
refers to the aligned region not to a match (identical base), such that for example
10M could mean 10 matches, 9 matches + 1 mismatch, 8 matches+2mismatches, etc.
In my humble opinion, this renders the CIGAR pretty much useless to represent an alignment. To address this, it seems that the
MD=tags have been introduced, but they just make the whole thing more complex and cumbersome.
I don't know how this
could have been overlooked in the design, or if it was done on purpose to keep the string compact. Anyway, I see this as a design flaw, that should be corrected. To do that in the definition is easy, let
M denote matched positions only, while
X (which is already in the definition) must be used to denote
mismatches, such that 10M = 10 matches, 9M1X = 9M followed by 1 mismatch, 5M1X4M, 5 matches, 1 mismatch, 4 matches, and so on.
Are you with me in this?