Question: SAM MD tag for consecutive insertions/deletions
gravatar for marongiu.luigi
17 days ago by
Germany, Mannheim, UMM
marongiu.luigi130 wrote:


I am struggling to understand the MD optional tag of the SAM format. Let's say I have the following alignment:



I understrand the CIGAR for the seq entry should be (I believe): 3M1D1I3M2D3M2I2M but what about the relative MD tag? Deletions in the query sequence are indicated with the caret ^ but the deletions? and would consecutive insertions be treated independently with a series of carets? so that the MD field would be: 1T1^C{unknown}3^C^C3{unknown}{unknown}2?

Thank you

snp sequence alignment genome • 113 views
ADD COMMENTlink modified 16 days ago • written 17 days ago by marongiu.luigi130
gravatar for dariober
17 days ago by
Glasgow - UK
dariober8.9k wrote:

I think your cigar string should end with 1M, not 2M and the resulting MD tag should be 1T1^C3^CC2C1. I haven't done it by hand as I haven't figured out the details of the MD syntax myself. Instead, I used Java/htsjdk with the following code, provided that I haven't made any mistake:

public void test(){
    SAMRecord rec= new SAMRecord(null);
    byte[] ref= "ATGCTTTCCGGCCC".getBytes();
    SequenceUtil.calculateMdAndNmTags(rec, ref, true, true);
ADD COMMENTlink written 17 days ago by dariober8.9k

interesting script! anyhow, I made a typo in the seq sequence, it should have ended with a double C (corrected), sorry. If I understand correctly, MD ignores the deletions since they do not affect the reference sequence, whereas insertions get the caret; from what you worte, consecutive insertions are indicated by a single caret, right? Tx

ADD REPLYlink written 16 days ago by marongiu.luigi130

Mmm, not quite. The purpose of the MD tag is to reconstruct the reference sequence given the read sequence, the cigar string and, of course, the MD tag. Insertions in the reference are not part of the reference (by definition!) so they don't need to be encoded in the MD. The caret indicates that there is a deletion in the reference, i.e. there are reference bases missing in the read sequence, so we put these missing bases in the MD. I think you've got your deletion/insertion terminology the other way round. "Insertion" means insertion in the reference and "deletion" means deletions in the reference (not in the read). Hope it makes sense...! Have a look also at this nice explanation

ADD REPLYlink written 16 days ago by dariober8.9k

it is very confusing, i have also read the link you sent me but since there are not the original sequences it is difficult for me to follow. since ^ indicates a deletion in the reference, shouldn't the MD of this example have a ^T^G since TG are present in seq but missing in the reference? (the first substitution I indicated can simply be reported as a mutation). So the MD should be 1T1C3 for the first 7 bases. But afterwards? should i simply ignore the following CC that represent a deletion in seq? then there is a 2C then a ^T^G (or simply ^TG) and a final 2?

ADD REPLYlink modified 16 days ago • written 16 days ago by marongiu.luigi130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 537 users visited in the last hour