I have a SAM file with alignments and for each entry alignment, I want to reconstruct the alignment between the reference and the read based on the CIGAR and MD strings. It seems like this should be possible, but this one example bothers me:
SRR037452.3355 0 ENSG00000266658|ENST00000607521 3523 255 16M1I18M * 0 0 CGGGCCGGTCCCCCCCCGCCGGGTCCGCCCCCGGC IIIIIIIIIIIIIII:III/IIII=+IGC,I"/I. NH:i:1 HI:i:1 NM:i:2 MD:Z:33G0
Here, the CIGAR string has an insertion to the reference which messes up the MD string indexing. According to MD string, the read should be only 34 bases long (r=35). My guess is that the alignment is actually this: 16 matches, 1 base that is present in the read, not present in the reference, 17 matches, one mismatch where the read has a "G" and the reference has something else. Is that correct? Are there CIGAR/MD string combinations where reconstructing the alignment would be impossible (i.e. either of the strings is ambiguous)?