Question: Should We Attempt To Redefine The Cigar String In Sam/Bam Format?
gravatar for Michael Dondrup
5.5 years ago by
Bergen, Norway
Michael Dondrup42k wrote:

My last question has led me to assumption that the CIGAR string in SAM/BAM files is possibly not very well-defined. Summarized: you cannot calculate a string-difference (e.g. Levenshtein distance) from a CIGAR string and therefore, the sequence-similarity within the aligned region cannot be computed.

The reason for this is quite trivial, CIGAR doesn't differentiate matches and mismatches: According to the SAM format specification the M character in a CIGAR

M alignment match (can be a sequence match or mismatch)

refers to the aligned region not to a match (identical base), such that for example 10M could mean 10 matches, 9 matches + 1 mismatch, 8 matches+2mismatches, etc.

In my humble opinion, this renders the CIGAR pretty much useless to represent an alignment. To address this, it seems that the MD=tags have been introduced, but they just make the whole thing more complex and cumbersome.

I don't know how this could have been overlooked in the design, or if it was done on purpose to keep the string compact. Anyway, I see this as a design flaw, that should be corrected. To do that in the definition is easy, let M denote matched positions only, while X (which is already in the definition) must be used to denote mismatches, such that 10M = 10 matches, 9M1X = 9M followed by 1 mismatch, 5M1X4M, 5 matches, 1 mismatch, 4 matches, and so on.

Are you with me in this?

cigar format samtools sam • 7.3k views
ADD COMMENTlink modified 5.5 years ago by lh330k • written 5.5 years ago by Michael Dondrup42k

If M stood for gender it would mean "non-hermaphrodite". You would have to look up male/female in the MD field.

ADD REPLYlink written 5.5 years ago by Jeremy Leipzig17k

This has certainly been a pet peeve of mine!

ADD REPLYlink written 5.5 years ago by Mikael Huss4.5k

Hear hear. The fact that you need to look in two optional fields in order to recreate an alignment is really evil.

ADD REPLYlink written 5.5 years ago by Martin A Hansen3.0k

Yes, it is evil maasha (not the root of all maybe), I wanted to provoke a discussion :) Let's say, it makes things more complex and error prone, and these fields don't always exist. As you saw in my related question it didn't work for me.

ADD REPLYlink written 5.5 years ago by Michael Dondrup42k

do it for science!

ADD REPLYlink written 5.5 years ago by Damian Kao14k

Ok then, now the real work starts ;) This of course would have large implications on all software packages using this format.

ADD REPLYlink written 5.5 years ago by Michael Dondrup42k

Jeremy, can you explain this a bit ???

ADD REPLYlink written 5.5 years ago by Michael Dondrup42k

@Michael I think that's just a joke on the definition of M meaning match or mismatch

ADD REPLYlink written 5.5 years ago by Istvan Albert ♦♦ 73k
gravatar for lh3
5.5 years ago by
United States
lh330k wrote:

The CIGAR format was invented in exonerate. The original CIGAR only has three operators: M, I and D. Conceptually it is not inconsistent to mix a sequence match/mismatch. M means an alignment match, of the same position of I and D (alignment insertion and deletion). Back to those days, psl, lav, exonerate and ssaha-cigar do not distinguish sequence match/mismatch, either. To an alignment, how to place gaps is essential and where are the mismatches is inferred information; even if you know the mismatching positions, you are yet to know the reference base.

About two years ago, someone (sorry, I have changed my laptop and cannot get the name of this developer from the email archive now) proposed to distinguish sequence match/mismatch in SAM. Most of others in samtools-devel thought this is a convenient idea, which finally led to the addition of =/X. "M" is not changed and this is important for three reasons. The first has been explained well by Jared: CIGAR is the minimal representation of an alignment. Myself will also strongly object to forcing everyone to use a verbose representation =/X. That is against elegancy. Secondly, reusing "M" breaks backward compatibility, while backward compatibility is essential to a format. Thirdly, requiring to distinguish a sequence match and a mismatch will make it harder to convert traditional formats such as psl/lav/exonerate/... In all, anyway we should keep "M".

As we have "=/X" now, what is the use of them? I can only think of two use cases: 1) eyeballing the mismatching positions and 2) computing the edit distance. For the former, MD works and an alignment viewer serves the goal even better when the alignment is sorted. For the latter, the NM tag is the way to go. Nearly all the mainstream mappers output this tag and both Picard and SAMtools compute it as well. For variant calling and read counting, =/X are mostly irrelevant.

The =/X operators have been added for about two years, but they are rarely seen in practice. This already implies that this is not an indispensable feature. SAM/BAM has flaws and a few are serious, but no, not forcing "=/X" is not one of them. If you feel unhappy, write a "M->=/X" converter and keep your alignment that way.

ADD COMMENTlink written 5.5 years ago by lh330k
gravatar for Jts
5.5 years ago by
Jts1.1k wrote:

I don't think this is a good idea. BAM is designed to be a compact representation with minimal data redundancy. If you require the mismatching bases are encoded in the CIGAR string, you will pay a cost of 4 bytes for every base calling error and SNP in your data set. This is a huge cost to pay for information that is easily recovered given the (current) CIGAR strings and the reference.

The current spec allows you to write CIGAR strings with X/= operators if you wish. I think it would be a mistake to require their use within the spec.

ADD COMMENTlink written 5.5 years ago by Jts1.1k

While there might be some benefit I don't think the the 4 bytes/SNP is the main reason. Most of the time you will end up with a more verbose MD tag anyhow so that makes most alignment files larger than what they should be. This is akin to saying why even store the read in the first place, now that would save some space - after all if you have a CIGAR string you can always get the alignment.

ADD REPLYlink written 5.5 years ago by Istvan Albert ♦♦ 73k

@Istvan my argument is not that it is too expensive to store that information but that you shouldn't be forced to store it.

ADD REPLYlink written 5.5 years ago by Jts1.1k

@Michael I know samtools will handle X/= but I do not know about other tools.

ADD REPLYlink written 5.5 years ago by Jts1.1k

That is a pretty good argument, I assume this is the reason for the curent design. But is there any software that supports this output?

ADD REPLYlink written 5.5 years ago by Michael Dondrup42k

ok but the main point of the post was that the CIGAR string was not a "compact representation" instead it is a more of a compact partial representation

ADD REPLYlink written 5.5 years ago by Istvan Albert ♦♦ 73k

If compactness was a design goal of BAM and the availability of the reference was assumed, we would store CIGAR and the inverse of the MD in a combined field and not store the sequence at all. Besides, it's 4 bytes per SNP and read before compression, but a lot less afterwards.

ADD REPLYlink written 5.5 years ago by Marvin770
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1448 users visited in the last hour