Should We Attempt To Redefine The Cigar String In Sam/Bam Format?
2
14
Entering edit mode
10.5 years ago

My last question has led me to assumption that the CIGAR string in SAM/BAM files is possibly not very well-defined. Summarized: you cannot calculate a string-difference (e.g. Levenshtein distance) from a CIGAR string and therefore, the sequence-similarity within the aligned region cannot be computed.

The reason for this is quite trivial, CIGAR doesn't differentiate matches and mismatches: According to the SAM format specification the M character in a CIGAR

M alignment match (can be a sequence match or mismatch)

refers to the aligned region not to a match (identical base), such that for example 10M could mean 10 matches, 9 matches + 1 mismatch, 8 matches+2mismatches, etc.

In my humble opinion, this renders the CIGAR pretty much useless to represent an alignment. To address this, it seems that the MD=tags have been introduced, but they just make the whole thing more complex and cumbersome.

I don't know how this could have been overlooked in the design, or if it was done on purpose to keep the string compact. Anyway, I see this as a design flaw, that should be corrected. To do that in the definition is easy, let M denote matched positions only, while X (which is already in the definition) must be used to denote mismatches, such that 10M = 10 matches, 9M1X = 9M followed by 1 mismatch, 5M1X4M, 5 matches, 1 mismatch, 4 matches, and so on.

Are you with me in this?

cigar sam samtools format • 11k views
9
Entering edit mode

If M stood for gender it would mean "non-hermaphrodite". You would have to look up male/female in the MD field.

3
Entering edit mode

This has certainly been a pet peeve of mine!

0
Entering edit mode

Hear hear. The fact that you need to look in two optional fields in order to recreate an alignment is really evil.

0
Entering edit mode

Yes, it is evil maasha (not the root of all maybe), I wanted to provoke a discussion :) Let's say, it makes things more complex and error prone, and these fields don't always exist. As you saw in my related question it didn't work for me.

0
Entering edit mode

do it for science!

0
Entering edit mode

Ok then, now the real work starts ;) This of course would have large implications on all software packages using this format.

0
Entering edit mode

Jeremy, can you explain this a bit ???

0
Entering edit mode

@Michael I think that's just a joke on the definition of M meaning match or mismatch

12
Entering edit mode
10.5 years ago
lh3 33k

The CIGAR format was invented in exonerate. The original CIGAR only has three operators: M, I and D. Conceptually it is not inconsistent to mix a sequence match/mismatch. M means an alignment match, of the same position of I and D (alignment insertion and deletion). Back to those days, psl, lav, exonerate and ssaha-cigar do not distinguish sequence match/mismatch, either. To an alignment, how to place gaps is essential and where are the mismatches is inferred information; even if you know the mismatching positions, you are yet to know the reference base.

About two years ago, someone (sorry, I have changed my laptop and cannot get the name of this developer from the email archive now) proposed to distinguish sequence match/mismatch in SAM. Most of others in samtools-devel thought this is a convenient idea, which finally led to the addition of =/X. "M" is not changed and this is important for three reasons. The first has been explained well by Jared: CIGAR is the minimal representation of an alignment. Myself will also strongly object to forcing everyone to use a verbose representation =/X. That is against elegancy. Secondly, reusing "M" breaks backward compatibility, while backward compatibility is essential to a format. Thirdly, requiring to distinguish a sequence match and a mismatch will make it harder to convert traditional formats such as psl/lav/exonerate/... In all, anyway we should keep "M".

As we have "=/X" now, what is the use of them? I can only think of two use cases: 1) eyeballing the mismatching positions and 2) computing the edit distance. For the former, MD works and an alignment viewer serves the goal even better when the alignment is sorted. For the latter, the NM tag is the way to go. Nearly all the mainstream mappers output this tag and both Picard and SAMtools compute it as well. For variant calling and read counting, =/X are mostly irrelevant.

The =/X operators have been added for about two years, but they are rarely seen in practice. This already implies that this is not an indispensable feature. SAM/BAM has flaws and a few are serious, but no, not forcing "=/X" is not one of them. If you feel unhappy, write a "M->=/X" converter and keep your alignment that way.

6
Entering edit mode
10.5 years ago
Jts ★ 1.4k

I don't think this is a good idea. BAM is designed to be a compact representation with minimal data redundancy. If you require the mismatching bases are encoded in the CIGAR string, you will pay a cost of 4 bytes for every base calling error and SNP in your data set. This is a huge cost to pay for information that is easily recovered given the (current) CIGAR strings and the reference.

The current spec allows you to write CIGAR strings with X/= operators if you wish. I think it would be a mistake to require their use within the spec.

4
Entering edit mode

While there might be some benefit I don't think the the 4 bytes/SNP is the main reason. Most of the time you will end up with a more verbose MD tag anyhow so that makes most alignment files larger than what they should be. This is akin to saying why even store the read in the first place, now that would save some space - after all if you have a CIGAR string you can always get the alignment.

2
Entering edit mode

@Istvan my argument is not that it is too expensive to store that information but that you shouldn't be forced to store it.

2
Entering edit mode

@Michael I know samtools will handle X/= but I do not know about other tools.

0
Entering edit mode

That is a pretty good argument, I assume this is the reason for the curent design. But is there any software that supports this output?

0
Entering edit mode

ok but the main point of the post was that the CIGAR string was not a "compact representation" instead it is a more of a compact partial representation

0
Entering edit mode

If compactness was a design goal of BAM and the availability of the reference was assumed, we would store CIGAR and the inverse of the MD in a combined field and not store the sequence at all. Besides, it's 4 bytes per SNP and read before compression, but a lot less afterwards.

Traffic: 1724 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.