Question: SAM MD tag for consecutive insertions/deletions
1
gravatar for marongiu.luigi
6 months ago by
Germany, Mannheim, UMM
marongiu.luigi290 wrote:

Hello,

I am struggling to understand the MD optional tag of the SAM format. Let's say I have the following alignment:

ref ATGC-TTTCCGGC--CC

seq ACG-ATTT--GGATGCC

I understrand the CIGAR for the seq entry should be (I believe): 3M1D1I3M2D3M2I2M but what about the relative MD tag? Deletions in the query sequence are indicated with the caret ^ but the deletions? and would consecutive insertions be treated independently with a series of carets? so that the MD field would be: 1T1^C{unknown}3^C^C3{unknown}{unknown}2?

Thank you

snp sequence alignment genome • 352 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by marongiu.luigi290
0
gravatar for dariober
6 months ago by
dariober9.4k
Glasgow - UK
dariober9.4k wrote:

I think your cigar string should end with 1M, not 2M and the resulting MD tag should be 1T1^C3^CC2C1. I haven't done it by hand as I haven't figured out the details of the MD syntax myself. Instead, I used Java/htsjdk with the following code, provided that I haven't made any mistake:

@Test
public void test(){
    SAMRecord rec= new SAMRecord(null);
    rec.setCigarString("3M1D1I3M2D3M2I1M");
    rec.setReadBases("ACGATTTGGATGC".getBytes());
    rec.setAlignmentStart(1);
    byte[] ref= "ATGCTTTCCGGCCC".getBytes();
    SequenceUtil.calculateMdAndNmTags(rec, ref, true, true);
    System.err.println(rec.getSAMString());
}
ADD COMMENTlink written 6 months ago by dariober9.4k

interesting script! anyhow, I made a typo in the seq sequence, it should have ended with a double C (corrected), sorry. If I understand correctly, MD ignores the deletions since they do not affect the reference sequence, whereas insertions get the caret; from what you worte, consecutive insertions are indicated by a single caret, right? Tx

ADD REPLYlink written 6 months ago by marongiu.luigi290

Mmm, not quite. The purpose of the MD tag is to reconstruct the reference sequence given the read sequence, the cigar string and, of course, the MD tag. Insertions in the reference are not part of the reference (by definition!) so they don't need to be encoded in the MD. The caret indicates that there is a deletion in the reference, i.e. there are reference bases missing in the read sequence, so we put these missing bases in the MD. I think you've got your deletion/insertion terminology the other way round. "Insertion" means insertion in the reference and "deletion" means deletions in the reference (not in the read). Hope it makes sense...! Have a look also at this nice explanation https://github.com/vsbuffalo/devnotes/wiki/The-MD-Tag-in-BAM-Files

ADD REPLYlink written 6 months ago by dariober9.4k

it is very confusing, i have also read the link you sent me but since there are not the original sequences it is difficult for me to follow. since ^ indicates a deletion in the reference, shouldn't the MD of this example have a ^T^G since TG are present in seq but missing in the reference? (the first substitution I indicated can simply be reported as a mutation). So the MD should be 1T1C3 for the first 7 bases. But afterwards? should i simply ignore the following CC that represent a deletion in seq? then there is a 2C then a ^T^G (or simply ^TG) and a final 2?

ADD REPLYlink modified 6 months ago • written 6 months ago by marongiu.luigi290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1679 users visited in the last hour