Question: Is There Any R Package To Parse Cigar-Element Of Sam Files?
gravatar for Eva
7.7 years ago by
Eva30 wrote:


i need to parse the Cigar-Element or rather the MD field of SAM files to get the reference sequence, is there any script/tool or package (especially for R)?

For example:

Cigar: 4M

MD: 0C0T2

Sequence: ACGT

result what i want is: CTGT

so that i can just count the mismatches by comparing strings, or is there any other way?

cigar samtools R parsing • 4.9k views
ADD COMMENTlink modified 7.6 years ago by Michael Dondrup45k • written 7.7 years ago by Eva30

Not an answer really, but CIGAR parsing is best tackled with regular expressions. Thus, the existing Perl and Python BAM APIs are a better solution. IMHO, this sort of work is not in R's wheelhouse.

ADD REPLYlink written 7.7 years ago by Aaronquinlan10k

HTseq: has parse_cigar function to work with CIGAR strings:

ADD REPLYlink modified 4.4 years ago • written 7.7 years ago by Rm7.8k

Why not retrieving the reference sequence just by matching the exact coordinate of the read in the reference file and then make your comparison ? (instead of trying to regenerate ref with CIGAR)

ADD REPLYlink written 7.7 years ago by toni2.1k

thank you all for your answers i will try it

ADD REPLYlink written 7.6 years ago by Eva30
gravatar for Michael Dondrup
7.0 years ago by
Bergen, Norway
Michael Dondrup45k wrote:

I have made some very simple R-functions that can parse the CIGAR string. I put it here for your reference.

matcher <- function(pattern, x) {

  ind = gregexpr(pattern, x)[[1]]
  start = as.numeric(ind)
  end = start + attr(ind, "match.length")- 2
  apply(cbind(start,end), 1, function(y) substr(x, start=y[1], stop=y[2]));
doone <- function(c, cigar) {
  pat <- paste("\\d+", c , sep="")
  sum(as.numeric(matcher(pat, cigar)), na.rm=T)

## takes a cigar string and parses it, not very fast but...
cigarsums <- function(cigar, chars=c("M","N","D","I","S","H", "P", "X", "=")) {
   sapply (chars, doone, cigar)

Now this can be used in the following way:

 sapply (aln1[140:150,]@cigar, cigarsums)
  40M 35M 36M 18M5I12M 40M 14M5I17M 11M5I19M 35M 35M 35M 35M
M  40  35  36       30  40       31       30  35  35  35  35
N   0   0   0        0   0        0        0   0   0   0   0
D   0   0   0        0   0        0        0   0   0   0   0
I   0   0   0        5   0        5        5   0   0   0   0
S   0   0   0        0   0        0        0   0   0   0   0
H   0   0   0        0   0        0        0   0   0   0   0
ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Michael Dondrup45k
gravatar for User 2383
7.3 years ago by
User 238310
User 238310 wrote:

I was compelled to zombify this thread, because I hacked my way through an analogous problem a few days ago. It's an ugly solution. There are probably more elegant ones in R, and surely more elegant ones using another tool.

## create example data
vcfInput=structure(list(V1 = c("chr37", "chr37"), V2 = c(27100083L, 27100196L
), V3 = c("G", "C"), V4 = c("C", "G"), V5 = c(999, 43.5), V6 = c("DP=70;VDB=0.0256;AF1=0.9549;AC1=25;DP4=1,3,37,24;MQ=48;FQ=72.8;PV4=0.3,2.1e-13,0.15,1", 
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), row.names = 1:2, class = "data.frame")
colnames(vcfInput)=c("chr", "pos","ref","alt","qual","packed")

## define a trivial convenience function
getAllListSubItemsByIndex <-function (theList, theIndex){ paste(lapply(theList,function(x) x=x[[theIndex]]))}

## identify each kind of coded value packed in the string:
valNames=unique(unlist(lapply(1:nrow(vcfInput),function(i) getAllListSubItemsByIndex (strsplit(strsplit(vcfInput $packed[i], split=";")[[1]], split="="),1))))

## add on to the existing data frame
newColIndices=(1 + ncol(vcfInput)):(ncol(vcfInput)+length(valNames))
vcfInput[, newColIndices]=NA
colnames(vcfInput)[newColIndices]= valNames

## collect the values
for (i in 1:nrow(vcfInput)){
    pairs=strsplit(vcfInput$packed[i], split=";")
    vals=getAllListSubItemsByIndex (strsplit(pairs[[1]], split="="),2)
    currentValNames=getAllListSubItemsByIndex (strsplit(pairs[[1]], split="="),1)
    vcfInput [i, currentValNames]= vals
ADD COMMENTlink written 7.3 years ago by User 238310

What is the code trying to do? Looks like it's parsing VCF in R - which may be useful to someone - but not SAM or CIGAR formats.

ADD REPLYlink written 7.3 years ago by Chris Penkett470
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1149 users visited in the last hour