How to check if a read in a SAM file contains substitutions
2
0
Entering edit mode
3.6 years ago

How I do check whether a read contains substitutions or not. I'm not sure if I should check the flag, the cigar or the sequence.

sam bam read c++ • 891 views
1
Entering edit mode
3.6 years ago

Typically this is indicated in the NM auxiliary tag. Since you have C++ in the tags, I'll note that you can retrieve this in htslib via uint8_t *v = bam_aux_get(b, "NM") and then int64_t NM = bam_aux2i(v);.

0
Entering edit mode

That's great, what exactly do I have to check in the NM tag?

0
Entering edit mode

I gave you two C commands that use htslib in my original reply.

0
Entering edit mode

NM is defined as the edit distance to the reference. Insertions and deletions are included . Most caller do not included soft or hard clipped bases in their calculation (the specs do not define whether they should or not).

0
Entering edit mode

True, one would also need to parse the CIGAR string and subtract appropriately.

1
Entering edit mode
3.6 years ago
d-cameron ★ 2.3k

If you are looking exclusively for base substitution for alignments that use M CIGAR operators (as opposed to X and =), then you either have to a) check each base against the reference and calculate mismatches yourself, or b) use the NM tag and adjust for any indels in the alignment.

3
Entering edit mode

To add to this answer, A while back a wrote a program, ExpandCigar, that expands the M operator with 'X' or '='. After that, it's easy to extract reads where the CIGAR string contains X.