Question: Position Of Mismatches Per Read From A Sam/Bam File
gravatar for Arun
6.7 years ago by
Arun2.3k wrote:

Hello all,

From a bam (or sam) file, by looking at the MD:Z field, we could identify the position of mismatches. For example, if the cigar string is 45M30N35M with the mismatches occuring at positions 5(A->T) and 65(C->G), then the MD flag would be MD:Z:4A59C15 to reflect the mismatched bases at 5th and 65th position. Of course it gets a bit complicated if there are consecutive mismatches and/or deletions followed by mismatches. If you're interested, this post explains it very well.

What I am interested in is, given the cigar string and MD:Z tag, to obtain the position of mismatches in a vector. In this case it would be 5 and 65. I could implement it myself, but I am half-minded about it (due to time restrictions) and was wondering if any of the already existing R-packages (like GenomicRanges and such) have a way of obtaining this info directly. Are there any packages anyone is aware of?

Thank you very much in advance. And I wish the biostars forum a Merry Christmas and a very happy new year!

bam sam rna-seq • 15k views
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Arun2.3k
gravatar for Pierre Lindenbaum
6.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

UPDATED: moved to

The following code should do the job:


 javac -cp path/to/sam.jar:/path/to/picard.jar


 java -cp path/to/sam.jar:/path/to/picard.jar:. Biostar59647  \
           ~/samtools-0.1.18/examples/toy.fa \
           ~/samtools-0.1.18/examples/toy.bam \
            ref:9-20 | xmllint --format -


<?xml version="1.0" encoding="UTF-8"?>
<biostar59647 ref="/home/lindenb/samtools-0.1.18/examples/toy.fa" bam="/home/lindenb/samtools-0.1.18/examples/toy.bam">
  <interval chrom="ref" start="9" end="20">
        <M read-index="1" read-base="T" ref-index="7"/>
        <M read-index="2" read-base="T" ref-index="8"/>
        <M read-index="3" read-base="A" ref-index="9" ref-base="A"/>
        <M read-index="4" read-base="G" ref-index="10" ref-base="G"/>
        <M read-index="5" read-base="A" ref-index="11" ref-base="A"/>
        <M read-index="6" read-base="T" ref-index="12" ref-base="T"/>
        <M read-index="7" read-base="A" ref-index="13" ref-base="A"/>
        <M read-index="8" read-base="A" ref-index="14" ref-base="A"/>
        <I read-index="9" read-base="A"/>
        <I read-index="10" read-base="G"/>
        <I read-index="11" read-base="A"/>
        <I read-index="12" read-base="G"/>
        <M read-index="13" read-base="G" ref-index="15" ref-base="G"/>
        <M read-index="14" read-base="A" ref-index="16" ref-base="A"/>
        <M read-index="15" read-base="T" ref-index="17" ref-base="T"/>
        <M read-index="16" read-base="A" ref-index="18" ref-base="A"/>
        <D ref-index="19" ref-base="G"/>
        <M read-index="17" read-base="C" ref-index="20" ref-base="C"/>
        <M read-index="18" read-base="T" ref-index="21"/>
        <M read-index="19" read-base="G" ref-index="22"/>
        <S read-index="1" read-base="A"/>
        <I read-index="2" read-base="A"/>
        <I read-index="3" read-base="A"/>
        <M read-index="4" read-base="A" ref-index="9" ref-base="A"/>
        <M read-index="5" read-base="G" ref-index="10" ref-base="G"/>
        <M read-index="6" read-base="A" ref-index="11" ref-base="A"/>
        <M read-index="7" read-base="T" ref-index="12" ref-base="T"/>
        <M read-index="8" read-base="A" ref-index="13" ref-base="A"/>
        <M read-index="9" read-base="A" ref-index="14" ref-base="A"/>
        <I read-index="10" read-base="G"/>
        <I read-index="11" read-base="G"/>
        <M read-index="12" read-base="G" ref-index="15" ref-base="G"/>
        <M read-index="13" read-base="A" ref-index="16" ref-base="A"/>
        <M read-index="14" read-base="T" ref-index="17" ref-base="T"/>
        <M read-index="15" read-base="A" ref-index="18" ref-base="A"/>
        <I read-index="16" read-base="A"/>
        <I read-index="17" read-base="A"/>
        <M read-index="1" read-base="A" ref-index="9" ref-base="A"/>
        <M read-index="2" read-base="G" ref-index="10" ref-base="G"/>
        <M read-index="3" read-base="C" ref-index="11" ref-base="A" mismatch="true"/>
        <M read-index="4" read-base="T" ref-index="12" ref-base="T"/>
        <M read-index="5" read-base="A" ref-index="13" ref-base="A"/>
        <M read-index="6" read-base="A" ref-index="14" ref-base="A"/>
ADD COMMENTlink modified 5.7 years ago • written 6.7 years ago by Pierre Lindenbaum122k

Hi Pierre, Thanks for the code. I've trouble running it (the execute step). Error message: javac: invalid flag: reference.fa. Also I managed to do this in R as its a part of my pipeline. It'd be great to get this running however.

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Arun2.3k

I've added a line: with "setDefaultValidationStringency"

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Pierre Lindenbaum122k
  • I think you typed "javac" instead of "java"
ADD REPLYlink written 6.7 years ago by Pierre Lindenbaum122k

you're right. It works like a charm. Thank you.

ADD REPLYlink written 6.7 years ago by Arun2.3k

Hello Pierre, where can I get sam.jar?

ADD REPLYlink written 21 months ago by vaslanzadeh0

see my "update' above . there is no more sam.jar, this is now a standalone software:

ADD REPLYlink written 21 months ago by Pierre Lindenbaum122k
gravatar for Arun
6.7 years ago by
Arun2.3k wrote:

Thanks to Pierre for the code. However, I wanted the implementation in R as it will be part of a pipeline. Assuming that you extracted the bam file with NM and MD:Z flags, let's assume for a read, NM = 4 and MD flag = 0G15^GAC0T60T4^AA0C0 (this MD flag has all the combinations I could think of), the code that accomplishes the task of finding the position of mismatches is,

nm      <- 4
md      <- "0G15^GAC0T60T4^AA0C0"
# First crucial step is to split the md string at each substitution/deletion operation
md.gsub <- gsub("([\\^]*[ACGT]+)[0]*", " \\1 ", md)
# split each operation using strsplit
md.spl  <- strsplit(md.gsub, "[ ]+")[[1]]
this    <- as.integer()

# since its a relatively time-consuming operation
# use NM flag to calculate position of mismatches
# only if there are mismatches
if (nm != 0) {
    this <- lapply(md.spl, function(y) { 
        if (! {
            o <- rep("M", as.numeric(y))
        } else if( length(grep("\\^", y)) > 0) {
            o <- rep("D", nchar(y) - 1)
        } else if (nchar(y) == 1) {
            o <- rep("MM", 1)
    this <-, this)
    # after this step, we have a vector of length = 
    # read length and each position telling if its a 
    # match(M), mismatch(MM) or deletion (D)
    this <- which(this == "MM")
# [1]  1 20 81 88

In case you want to use this on a set of reads for a bam file loaded in R, then you can wrap this around with sapply on each MD tag.

Note: Since we are coercing characters as well to numeric data type with as.numeric, there will be warnings when you run this code. Its not an issue. That's what we expect.

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Arun2.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1871 users visited in the last hour