Is it possible to reconstruct alignment from CIGAR and MD strings alone?
2
7
Entering edit mode
7.4 years ago
Lynxoid ▴ 220

I have a SAM file with alignments and for each entry alignment, I want to reconstruct the alignment between the reference and the read based on the CIGAR and MD strings. It seems like this should be possible, but this one example bothers me:

SRR037452.3355  0       ENSG00000266658|ENST00000607521 3523    255     16M1I18M        *       0       0       CGGGCCGGTCCCCCCCCGCCGGGTCCGCCCCCGGC     IIIIIIIIIIIIIII:III/IIII=+IGC,I"/I. NH:i:1  HI:i:1  NM:i:2  MD:Z:33G0


Here, the CIGAR string has an insertion to the reference which messes up the MD string indexing. According to MD string, the read should be only 34 bases long (r=35). My guess is that the alignment is actually this: 16 matches, 1 base that is present in the read, not present in the reference, 17 matches, one mismatch where the read has a "G" and the reference has something else. Is that correct? Are there CIGAR/MD string combinations where reconstructing the alignment would be impossible (i.e. either of the strings is ambiguous)?

rna-seq alignment cigar • 6.6k views
0
Entering edit mode

For this alignment edit distance is 2. There is a "I" character in CIGAR string and "G" in the end of the MD string. This means there is an insertion in the middle as well a mismatch in the end of the alignment. The insertion showed in the CIGAR string should also be shown in the MD string . So the correct MD string should be something like: 16^C17G0. Which aligner did you use ? It may be possible that some aligner really care about the MD string as they are not widely used by the downstream tools (Just guessing).

1
Entering edit mode

According to the sam format specification, the MD tag specifies deletions with '^' but has no operator to specify insertions. Both the CIGAR and the MD tag are correct.

"The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string '10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string." -The sam format specification.

3
Entering edit mode

I never fully recognized this until today:

So neither the CIGAR nor the MD string is an unambiguous representation of the alignment, moreover one can only fully reconstruct the alignment by writing code of nontrivial complexity that needs to parse and combine information from the CIGAR string, the MD string and the sequence ... well that's just ridiculous.

1
Entering edit mode

"^" in MD specifies deletion in reference which is same as insertion in the aligned read. So here a 1 bp insertion in the read (non-reference) corresponds to "^C" in the MD string. PS: I haven't had my after lunch coffee yet and I may be mixing up things :-)

3
Entering edit mode

The wording is somewhat ambiguous, but I think you're backwards here. Here are some example reads aligned with BWA MEM.

With insertions:

HS2000-222_187:3:61:15153:22133 99      1       532113  22      3S65M2I30M      =       532312  299     TTGTACACTCGTGCACACATTCACACTCATACACACCCAAATCATACTCACATTCATGCACACATGTTCACACTCATGCTCACTCATACACACCCAGATC    HFHFDHHHHHHFHGHHFGHGEFFEBHFFFHHHFHFGHHHHHFHHHFHFHHHHFHFHHHHFHEGHEFEHHHEEHEBHGGHHHGHGHHHHGHHHHAHFF?FF    NM:i:2  MD:Z:95 AS:i:87 XS:i:95
HS2000-222_187:4:64:4235:36106  99      1       532113  16      5S65M2I28M      =       532311  298     GATTGTACACTCGTGCACACATTCACACTCATACACACCCAAATCATACTCACATTCATGCACACATGTTCACACTCATGCTCACTCATACACACCCAGA    EEEEE<FFFBEEECEDCCDB>DDDDFEDFFF@EFFFEFFFFFFFBCFFFEFFD<<F@?FFFFF?BE<EC8FDCDDEFBFFFFFFFCFFFBFFBFFDFEA6    NM:i:2  MD:Z:93 AS:i:85 XS:i:95


With deletions:

HS2000-222_187:2:48:3173:90818  163     1       167779  22      29M1D71M        =       167967  288     AAATAAATAAATAATTCAATGAAATTCCTAGATCCAAGGCTTTGCAATAAATATGTAAATAAATTTCCAATCTCCATACTGAAAGGTTAAAAGGAATGCT    HGHHBHHHHEHHGBHHHEGH;:4::><::>)**-;+)>>+-9;/49=6,7<9=3@,036'3<6*<D?DDD<736>9;<<>14:(339?6;CB7:B#####    NM:i:4  MD:Z:29^A7G48T7A6       AS:i:78 XS:i:78
HS2000-222_187:3:26:4645:185777 147     1       168468  0       3M1D97M =       168256  -313    AATAGACTTTACAGCAGCCGGGTGCAGTGGTGCAGGCCTGTAATCCCAGCACTTTGGCAGCAGAGGCAGGCGGATCACTTTGAGCTCAGGGCAACATAGC    =6=)<:7=7+DEDFFABGBG<GGFECHH?HHEHHGHHHHHHEHGFFFEFFD4FGFBHHHHH@HDGHGFEHDEGGDGEHHHGHHHGHHEGHHHHHHGHHHH    NM:i:1  MD:Z:3^A97      AS:i:97 XS:i:97

1
Entering edit mode

You are correct. My answer was purely based on the documentation in the SAM specification format. This line is kind of confusing "the next 5 reference bases are matches followed by a 2bp deletion from the reference". Anyways, thanks for correcting me and taking out time to post those alignments as an example.

0
Entering edit mode

So could the CIGAR and MD string be used to get percent identity of a reference region from the sam/bam file?

0
Entering edit mode

Ashutosh, I used the STAR aligner (it's super fast).

3
Entering edit mode
7.4 years ago
donfreed ★ 1.6k

Your guess is correct, the alignment has 16 matches, 1 unspecified insertion, followed by 17 matches and one mismatch ('G' on the reference which is substituted with a 'C' in the read sequence).

If the MD tag is present, you should always be able to reconstruct the alignment from the MD tag and the CIGAR, as long as you don't care about the sequence of any mismatched or inserted bases (which you can only get from the read sequence).

0
Entering edit mode

According to your understanding of CIGAR and MD string, where is the mismatch? Is it at position 33 of the original read, or position 34 since there was an insertion in the reference, but not in the read?

After playing with CIGAR and MD, I tend to agree with Istvan above: there are cases where you can not unambiguously reconstruct the alignment.

2
Entering edit mode

The mismatched base is the last base of your read 'C'. The reference you aligned to has a 'G' at that position. I was confused about this as well and I updated my answer for clarity.

The CIGAR should have information on every insertion or deletion (and any clipped sequence) in the alignment, but does not contain information on mismatches. The MD tag should have information on every deletion and mismatch in the alignment, but does not contain information on insertions. If you combine the two, you should have a complete picture of the entire alignment, as long as you don't care about the sequence of mismatches or insertions.

Combining the CIGAR and MD is not easy, but it should be possible in every case.

0
Entering edit mode

This is a good point: "(which you can only get from the read sequence)"

I believe that incorporating BAM read+MD tag+CIGAR string should be able to reconstruct the alignment.

4
Entering edit mode
7.3 years ago

Looks like there is a tool for doing just that: sam2pairwise: a tool to visualize SAM entries as pairwise alignments

3
Entering edit mode

Yup - sam2pairwise reconstructs the following pairwise alignment:

SRR037452.3355    0    ENSG00000266658|ENST00000607521    3523    255    16M1I18M    *    0    0
CGGGCCGGTCCCCCCCCGCCGGGTCCGCCCCCGGC
|||||||||||||||| |||||||||||||||||
CGGGCCGGTCCCCCCC-GCCGGGTCCGCCCCCGGG

2
Entering edit mode

This is so cool! Thanks for the tool. It makes things very obvious and clear. I am going to add to my lectures.

0
Entering edit mode

I just tried the sam2pairwise on some of my alignments, and I get the following error:

shift_cigar failed. Exiting.


thrA-schigella-dysenteriae 2210-2310    0       thrA-schigella-dysenteriae      2211    30      9=2D34=1D1X18=1X20=1X13=        *       0       0 TGATGAAGGAAGTTTTGCGCTATGTTGGCAATATTGATGAAGACGTGTCTGCCGCGTGAAGACTGCCGAAGTGGATGGTAATGTTCCGCTGTTCAAA       *  NM:i:6  AM:i:30


What could be causing the error?

2
Entering edit mode

sam2pairwise should be able to handle all the CIGAR characters mentioned on page 5 of the SAM spec. From what I can tell, there are two things preventing your entry from working:

1. There's a space in the QNAME field, and spaces aren't allowed in the name in the SAM format (see page 4 of the SAM spec). Since sam2pairwise splits on whitespace, it thinks that the sixth column (the CIGAR) is "30", rather than the actual CIGAR. That's causing the error to be thrown.
2. The entry lacks an MD tag. sam2pairwise doesn't ALWAYS need an MD tag to work, but it does if the CIGAR indicates there has been a deletion. That's because the CIGAR only indicates that there was a deletion at all; the MD tag tells you what the identity of the deleted bases were.
0
Entering edit mode

Your alignment is in the so called extended CIGAR format (see the X) values. The tools works on the classic CIGAR string.