Question: How To Define Percent-Identity From A Cigar String/ A Bam/Sam File?
7
gravatar for Michael Dondrup
8.1 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Hi, I would like to calculate the percent-identity from a CIGAR string from a BAM/SAM file containing alignments. I want to calculate the PID only for the aligned region, ignoring clipped ends ("H","S"). I can parse the CIGAR in R and get the sums for each letter in the CIGAR, so it's just a conceptual question about the definition and whether or not it is correct: so given the CIGAR contains characters: "M" (match),"N" (skip),"D" (Deletion),"I"(Insertion), "S" (soft-clip),"H" (hard-clip), of which I ignore "S", "H", and "M" is the total sum of M, N the total of N, etc.:

e.g.: 10S 20M5I5D20M 10S

Is this a good way of defining the formula?

pid1 := 100 * M / (M+N+I)

or maybe

pid2 := 100* M / (M+N+I-D)

the example: pid1 = ~88% but then pid2 = 100% (which wouldn't make much sense).

Related: http://biostar.stackexchange.com/questions/9358/is-there-any-r-package-to-parse-cigar-element-of-sam-files/17031#17031

Thank you very much

cigar bam sam • 11k views
ADD COMMENTlink modified 8.1 years ago by Wen.Huang1.2k • written 8.1 years ago by Michael Dondrup47k

Computationally, is there really a difference between skipped and deletion?

ADD REPLYlink written 8.1 years ago by Damian Kao15k

I thought that N was indicating a mismatch, or not?

ADD REPLYlink written 8.1 years ago by Michael Dondrup47k

None.

%identity = match-(NM)/length

S=20 M=40 I=5 D=5, length=20+40+5+5=70, match=40 ... so 57% (unless there were NM's).

Soft clipping = mismatch in most situations. Inserts & deletions are also mismatch.... in that they don't exist in the reference. Only way to get 100% is if it's ##M with no NM.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by earonesty230
6
gravatar for Damian Kao
8.1 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

I would say pid1 would be correct. Computationally, I don't think there is any difference between N (skipped) and D (deletion).

Both skipped and deletion should produce something like this:

read       ACGTACGT--ACGTACGT
reference  ACGTACGTAAACGTACGT

So if you add N + D, you are adding the gap twice.

edit **

The SAM specs is a bit confusing on the matter:

M alignment match (can be a sequence match or mismatch)
I insertion to the reference
D deletion from the reference
N skipped region from the reference
S soft clipping (clipped sequences present in SEQ)
H hard clipping (clipped sequences NOT present in SEQ)
P padding (silent deletion from padded reference)
= sequence match
X sequence mismatch

M can be a match or mismatch??

Further down in the sam specs:

  • For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined.
  • Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ
ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by Damian Kao15k
3

looks like M is both match and mismatch from the SAM specs. I think bowtie outputs M, H, and Ns. Looking at output from one of my runs using ABI LifeScope, I have CIGAR string of: 20H2S28M. Maybe the various mappers just don't conform to the standards?

ADD REPLYlink written 8.1 years ago by Damian Kao15k
3

The sam specification designates MD tag to inform of where the mismatches occur. Most aligners will fill this in correctly in the SAM file. Otherwise you can use calmd command of samtools to fill in this tag.

ADD REPLYlink written 8.1 years ago by Istvan Albert ♦♦ 82k
2

wait that's fishy, then there is no way of determining if it is a match or mismatch, like in 10M could mean anything? Then that would mean there is no way to calculate the identity, I can't believe it...

ADD REPLYlink written 8.1 years ago by Michael Dondrup47k

and the mismatches?

ADD REPLYlink written 8.1 years ago by Michael Dondrup47k

I guess it depends on whether you are doing a mRNA reads-to-genome alignment or genome reads-to-genome alignment? Ns are intronic in mRNA reads to genome. = and Xs are used in genome reads-to-genome?

ADD REPLYlink written 8.1 years ago by Damian Kao15k

Thank you DK, i will check that a bit further, but now believe I have to re-align the sequences using smith-waterman or so to get this data. If one can't differentiate a match from a mismatch (facepalm)

ADD REPLYlink written 8.1 years ago by Michael Dondrup47k

I have tried to run samtools calmd on my bam file but receive a segmentation fault. My bam file is the result of BWA-SW (long read algorithm). Will write to the samtools mailing list.

ADD REPLYlink written 8.1 years ago by Michael Dondrup47k

So I accepted this, because it led to the right path. From the CIGAR string alone it is impossible, and from a whole BAM file only when using the correct MD tags which must be set by the aligner or samtools calmd which crashes.

ADD REPLYlink written 8.1 years ago by Michael Dondrup47k

Maybe I should try to run calmd again?

ADD REPLYlink written 5.5 years ago by Michael Dondrup47k
5
gravatar for Wen.Huang
8.1 years ago by
Wen.Huang1.2k
Wen.Huang1.2k wrote:

No one seems to have mentioned it, but how about the "NM" tag ("edit distance" in SAM format specification)? I think this is the closest to "mismatches" that can be obtained in a relatively fast way by parsing bam files. One caveat, indels are always going to be a problem when defining identity and I don't know how they are counted in the "NM" tag, perhaps different programs have different ways of counting them.

ADD COMMENTlink written 8.1 years ago by Wen.Huang1.2k

Actually that was what I was looking for. Unfortunately, these tags are optional, aren't they, at least my file doesn't have them, but if most tools support these, this is obviously the way to go. In any case, my guess is that Levenshtein distance makes most sense, because that handles indels.

ADD REPLYlink written 8.1 years ago by Michael Dondrup47k
2
gravatar for Michael Dondrup
8.1 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Thanks to the answer from DK and Istvan's comment I conclude that it is impossible to compute percent identity from the CIGAR string alone. This is because the CIGAR string does not contain enough information to capture Levenshtein-distance. Imo this is actually a misguided design of the string.

It might be feasible by obtaining proper tag annotation, but this way is blocked for me at the moment due to a gefault in samtools calmd.

Therefore, I will describe a way of doing the computation by re-aligning the sequences in the bam-file using R functions pairwiseAlignment and stringDist. I will then define the PID as the quotient of Levenshtein-distance and the length of the maximum length of query, subject.

I will put my code here as a go along.

bam = scanBam("reads.bam")
genome = read.DNAStringSet("genome.fna")
pids = numeric(0)
# calculate coordinates to cut out reference sequences
# this might not work if there are large gaps in the alignment
# on the other hand query width should capture this
start = bam[[1]]$pos - bam[[1]]$qwidth
end = bam[[1]]$pos + 2 * bam[[1]]$qwidth
start[start < 1] = 1
end[end>length(genome[[1]])] = length(genome[[1]])

for (i in 1:length(bam[[1]]$seq))  { 
  # do pairwise alignment against the matched sequence +- 1 query length
  # according to Jeremy's comment
  aln = pairwiseAlignment(bam[[1]]$seq[i], subseq(genome[[1]], start= start[i], end=end[i]), type="local")
  d = stringDist(c(as.character(aln@pattern), as.character(aln@subject)), m="l", i=F) 
  l = max(c(width(aln@pattern),(width(aln@subject) )))
  aln = NULL
  cat (i);
  pid = as.numeric((l-d)/l) 
  print(pid)
  pids = c(pids,pid)
  gc()
}

Edit: Use only portions of the sequence as suggested by Jeremy, that speeds it up quite a bit.

This will take some time as pairwiseDistance using Smith-Waterman is quite slow and is therefore only feasible with small bam files. For larger files, other alignment tools might be more appropriate. But this demonstrates how it works. (You have to use a loop here, this is intentional, because using lapply with pairwiseAlignment opens a memory-leak, process killed at >20GB mem usage)

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by Michael Dondrup47k
1

Can't you extract some reference sequence (a window) around the reported hit? That would be a lot faster than redoing s-w vs the genome.

ADD REPLYlink written 8.1 years ago by Jeremy Leipzig19k

Good idea, I'll try that

ADD REPLYlink written 8.1 years ago by Michael Dondrup47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1300 users visited in the last hour