Position Of Mismatches Per Read From A Sam/Bam File
2
7
Entering edit mode
8.3 years ago
Arun 2.4k

Hello all,

From a bam (or sam) file, by looking at the MD:Z field, we could identify the position of mismatches. For example, if the cigar string is 45M30N35M with the mismatches occuring at positions 5(A->T) and 65(C->G), then the MD flag would be MD:Z:4A59C15 to reflect the mismatched bases at 5th and 65th position. Of course it gets a bit complicated if there are consecutive mismatches and/or deletions followed by mismatches. If you're interested, this post explains it very well.

What I am interested in is, given the cigar string and MD:Z tag, to obtain the position of mismatches in a vector. In this case it would be 5 and 65. I could implement it myself, but I am half-minded about it (due to time restrictions) and was wondering if any of the already existing R-packages (like GenomicRanges and such) have a way of obtaining this info directly. Are there any packages anyone is aware of?

Thank you very much in advance. And I wish the biostars forum a Merry Christmas and a very happy new year!

rna-seq sam bam • 18k views
ADD COMMENT
2
Entering edit mode
8.3 years ago

UPDATED: moved to https://github.com/lindenb/jvarkit/wiki/Biostar59647

The following code should do the job:

https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar59647.java

Compilation:

 javac -cp path/to/sam.jar:/path/to/picard.jar Biostar59647.java

Execute:

 java -cp path/to/sam.jar:/path/to/picard.jar:. Biostar59647  \
           ~/samtools-0.1.18/examples/toy.fa \
           ~/samtools-0.1.18/examples/toy.bam \
            ref:9-20 | xmllint --format -

Result:

<?xml version="1.0" encoding="UTF-8"?>
<biostar59647 ref="/home/lindenb/samtools-0.1.18/examples/toy.fa" bam="/home/lindenb/samtools-0.1.18/examples/toy.bam">
  <interval chrom="ref" start="9" end="20">
    <read>
      <name>r001</name>
      <sequence>TTAGATAAAGAGGATACTG</sequence>
      <flags>163</flags>
      <chrom>ref</chrom>
      <pos>7</pos>
      <cigar>8M4I4M1D3M</cigar>
      <align>
        <M read-index="1" read-base="T" ref-index="7"/>
        <M read-index="2" read-base="T" ref-index="8"/>
        <M read-index="3" read-base="A" ref-index="9" ref-base="A"/>
        <M read-index="4" read-base="G" ref-index="10" ref-base="G"/>
        <M read-index="5" read-base="A" ref-index="11" ref-base="A"/>
        <M read-index="6" read-base="T" ref-index="12" ref-base="T"/>
        <M read-index="7" read-base="A" ref-index="13" ref-base="A"/>
        <M read-index="8" read-base="A" ref-index="14" ref-base="A"/>
        <I read-index="9" read-base="A"/>
        <I read-index="10" read-base="G"/>
        <I read-index="11" read-base="A"/>
        <I read-index="12" read-base="G"/>
        <M read-index="13" read-base="G" ref-index="15" ref-base="G"/>
        <M read-index="14" read-base="A" ref-index="16" ref-base="A"/>
        <M read-index="15" read-base="T" ref-index="17" ref-base="T"/>
        <M read-index="16" read-base="A" ref-index="18" ref-base="A"/>
        <D ref-index="19" ref-base="G"/>
        <M read-index="17" read-base="C" ref-index="20" ref-base="C"/>
        <M read-index="18" read-base="T" ref-index="21"/>
        <M read-index="19" read-base="G" ref-index="22"/>
      </align>
    </read>
    <read>
      <name>r002</name>
      <sequence>AAAAGATAAGGGATAAA</sequence>
      <flags>0</flags>
      <chrom>ref</chrom>
      <pos>9</pos>
      <cigar>1S2I6M1P1I1P1I4M2I</cigar>
      <align>
        <S read-index="1" read-base="A"/>
        <I read-index="2" read-base="A"/>
        <I read-index="3" read-base="A"/>
        <M read-index="4" read-base="A" ref-index="9" ref-base="A"/>
        <M read-index="5" read-base="G" ref-index="10" ref-base="G"/>
        <M read-index="6" read-base="A" ref-index="11" ref-base="A"/>
        <M read-index="7" read-base="T" ref-index="12" ref-base="T"/>
        <M read-index="8" read-base="A" ref-index="13" ref-base="A"/>
        <M read-index="9" read-base="A" ref-index="14" ref-base="A"/>
        <I read-index="10" read-base="G"/>
        <I read-index="11" read-base="G"/>
        <M read-index="12" read-base="G" ref-index="15" ref-base="G"/>
        <M read-index="13" read-base="A" ref-index="16" ref-base="A"/>
        <M read-index="14" read-base="T" ref-index="17" ref-base="T"/>
        <M read-index="15" read-base="A" ref-index="18" ref-base="A"/>
        <I read-index="16" read-base="A"/>
        <I read-index="17" read-base="A"/>
      </align>
    </read>
    <read>
      <name>r003</name>
      <sequence>AGCTAA</sequence>
      <flags>0</flags>
      <chrom>ref</chrom>
      <pos>9</pos>
      <cigar>5H6M</cigar>
      <align>
        <M read-index="1" read-base="A" ref-index="9" ref-base="A"/>
        <M read-index="2" read-base="G" ref-index="10" ref-base="G"/>
        <M read-index="3" read-base="C" ref-index="11" ref-base="A" mismatch="true"/>
        <M read-index="4" read-base="T" ref-index="12" ref-base="T"/>
        <M read-index="5" read-base="A" ref-index="13" ref-base="A"/>
        <M read-index="6" read-base="A" ref-index="14" ref-base="A"/>
      </align>
    </read>
    <read>
      <name>r004</name>
      <sequence>ATAGCTCTCAGC</sequence>
      <fl
ADD COMMENT
0
Entering edit mode

Hi Pierre, Thanks for the code. I've trouble running it (the execute step). Error message: javac: invalid flag: reference.fa. Also I managed to do this in R as its a part of my pipeline. It'd be great to get this running however.

ADD REPLY
0
Entering edit mode

I've added a line: with "setDefaultValidationStringency"

ADD REPLY
1
Entering edit mode
  • I think you typed "javac" instead of "java"
ADD REPLY
1
Entering edit mode

you're right. It works like a charm. Thank you.

ADD REPLY
0
Entering edit mode

Hello Pierre, where can I get sam.jar?

ADD REPLY
0
Entering edit mode

see my "update' above . there is no more sam.jar, this is now a standalone software: http://lindenb.github.io/jvarkit/Biostar59647.html

ADD REPLY
0
Entering edit mode
8.3 years ago
Arun 2.4k

Thanks to Pierre for the code. However, I wanted the implementation in R as it will be part of a pipeline. Assuming that you extracted the bam file with NM and MD:Z flags, let's assume for a read, NM = 4 and MD flag = 0G15^GAC0T60T4^AA0C0 (this MD flag has all the combinations I could think of), the code that accomplishes the task of finding the position of mismatches is,

nm      <- 4
md      <- "0G15^GAC0T60T4^AA0C0"
# First crucial step is to split the md string at each substitution/deletion operation
md.gsub <- gsub("([\\^]*[ACGT]+)[0]*", " \\1 ", md)
# split each operation using strsplit
md.spl  <- strsplit(md.gsub, "[ ]+")[[1]]
this    <- as.integer()

# since its a relatively time-consuming operation
# use NM flag to calculate position of mismatches
# only if there are mismatches
if (nm != 0) {
    this <- lapply(md.spl, function(y) { 
        if (!is.na(as.numeric(y))) {
            o <- rep("M", as.numeric(y))
        } else if( length(grep("\\^", y)) > 0) {
            o <- rep("D", nchar(y) - 1)
        } else if (nchar(y) == 1) {
            o <- rep("MM", 1)
        }
    })
    this <- do.call(c, this)
    # after this step, we have a vector of length = 
    # read length and each position telling if its a 
    # match(M), mismatch(MM) or deletion (D)
    this <- which(this == "MM")
}
this
# [1]  1 20 81 88

In case you want to use this on a set of reads for a bam file loaded in R, then you can wrap this around with sapply on each MD tag.

Note: Since we are coercing characters as well to numeric data type with as.numeric, there will be warnings when you run this code. Its not an issue. That's what we expect.

ADD COMMENT

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6