Question

Is There Any R Package To Parse Cigar-Element Of Sam Files?

3

Entering edit mode

12.8 years ago

Eva ▴ 30

Hey,

i need to parse the Cigar-Element or rather the MD ﬁeld of SAM files to get the reference sequence, is there any script/tool or package (especially for R)?

For example:

Cigar: 4M

MD: 0C0T2

Sequence: ACGT

result what i want is: CTGT

so that i can just count the mismatches by comparing strings, or is there any other way?

cigar samtools r parsing • 9.1k views

ADD COMMENT • link updated 12.7 years ago by Michael 54k • written 12.8 years ago by Eva ▴ 30

1

Entering edit mode

Not an answer really, but CIGAR parsing is best tackled with regular expressions. Thus, the existing Perl and Python BAM APIs are a better solution. IMHO, this sort of work is not in R's wheelhouse.

ADD REPLY • link 12.8 years ago by Aaronquinlan 12k

1

Entering edit mode

HTseq: has parse_cigar function to work with CIGAR strings:

http://www-huber.embl.de/users/anders/HTSeq/doc/alignments.html#cigar-strings

ADD REPLY • link 9.6 years ago by Rm 8.3k

0

Entering edit mode

Why not retrieving the reference sequence just by matching the exact coordinate of the read in the reference file and then make your comparison ? (instead of trying to regenerate ref with CIGAR)

ADD REPLY • link 12.8 years ago by toni ★ 2.2k

0

Entering edit mode

thank you all for your answers i will try it

ADD REPLY • link 12.8 years ago by Eva ▴ 30

score 4 · Answer 1 · 2012-02-03

I have made some very simple R-functions that can parse the CIGAR string. I put it here for your reference.

matcher <- function(pattern, x) {

  ind = gregexpr(pattern, x)[[1]]
  start = as.numeric(ind)
  end = start + attr(ind, "match.length")- 2
  apply(cbind(start,end), 1, function(y) substr(x, start=y[1], stop=y[2]));
}
doone <- function(c, cigar) {
  pat <- paste("\\d+", c , sep="")
  sum(as.numeric(matcher(pat, cigar)), na.rm=T)
}


## takes a cigar string and parses it, not very fast but...
cigarsums <- function(cigar, chars=c("M","N","D","I","S","H", "P", "X", "=")) {
   sapply (chars, doone, cigar)
}

Now this can be used in the following way:

library(Rsamtools)
example(readBamGappedAlignments)
 sapply (aln1[140:150,]@cigar, cigarsums)
  40M 35M 36M 18M5I12M 40M 14M5I17M 11M5I19M 35M 35M 35M 35M
M  40  35  36       30  40       31       30  35  35  35  35
N   0   0   0        0   0        0        0   0   0   0   0
D   0   0   0        0   0        0        0   0   0   0   0
I   0   0   0        5   0        5        5   0   0   0   0
S   0   0   0        0   0        0        0   0   0   0   0
H   0   0   0        0   0        0        0   0   0   0   0

score 1 · Answer 2 · 2011-11-11

I was compelled to zombify this thread, because I hacked my way through an analogous problem a few days ago. It's an ugly solution. There are probably more elegant ones in R, and surely more elegant ones using another tool.

## create example data
vcfInput=structure(list(V1 = c("chr37", "chr37"), V2 = c(27100083L, 27100196L
), V3 = c("G", "C"), V4 = c("C", "G"), V5 = c(999, 43.5), V6 = c("DP=70;VDB=0.0256;AF1=0.9549;AC1=25;DP4=1,3,37,24;MQ=48;FQ=72.8;PV4=0.3,2.1e-13,0.15,1", 
"DP=78;VDB=0.0305;AF1=0.04422;AC1=1;DP4=36,37,2,3;MQ=49;FQ=43.7;PV4=1,1e-06,0.45,0.4"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), row.names = 1:2, class = "data.frame")
colnames(vcfInput)=c("chr", "pos","ref","alt","qual","packed")

## define a trivial convenience function
getAllListSubItemsByIndex <-function (theList, theIndex){ paste(lapply(theList,function(x) x=x[[theIndex]]))}

## identify each kind of coded value packed in the string:
valNames=unique(unlist(lapply(1:nrow(vcfInput),function(i) getAllListSubItemsByIndex (strsplit(strsplit(vcfInput $packed[i], split=";")[[1]], split="="),1))))

## add on to the existing data frame
newColIndices=(1 + ncol(vcfInput)):(ncol(vcfInput)+length(valNames))
vcfInput[, newColIndices]=NA
colnames(vcfInput)[newColIndices]= valNames

## collect the values
for (i in 1:nrow(vcfInput)){
    pairs=strsplit(vcfInput$packed[i], split=";")
    vals=getAllListSubItemsByIndex (strsplit(pairs[[1]], split="="),2)
    currentValNames=getAllListSubItemsByIndex (strsplit(pairs[[1]], split="="),1)
    vcfInput [i, currentValNames]= vals
    }