Question: How To Define Percent-Identity From A Cigar String/ A Bam/Sam File?
7
8.1 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Hi, I would like to calculate the percent-identity from a CIGAR string from a BAM/SAM file containing alignments. I want to calculate the PID only for the aligned region, ignoring clipped ends ("H","S"). I can parse the CIGAR in R and get the sums for each letter in the CIGAR, so it's just a conceptual question about the definition and whether or not it is correct: so given the CIGAR contains characters: "M" (match),"N" (skip),"D" (Deletion),"I"(Insertion), "S" (soft-clip),"H" (hard-clip), of which I ignore "S", "H", and "M" is the total sum of M, N the total of N, etc.:

e.g.: `10S 20M5I5D20M 10S`

Is this a good way of defining the formula?

``````pid1 := 100 * M / (M+N+I)
``````

or maybe

``````pid2 := 100* M / (M+N+I-D)
``````

the example: pid1 = ~88% but then pid2 = 100% (which wouldn't make much sense).

Thank you very much

cigar bam sam • 11k views
modified 8.1 years ago by Wen.Huang1.2k • written 8.1 years ago by Michael Dondrup47k

Computationally, is there really a difference between skipped and deletion?

I thought that N was indicating a mismatch, or not?

None.

%identity = match-(NM)/length

S=20 M=40 I=5 D=5, length=20+40+5+5=70, match=40 ... so 57% (unless there were NM's).

Soft clipping = mismatch in most situations. Inserts & deletions are also mismatch.... in that they don't exist in the reference. Only way to get 100% is if it's ##M with no NM.

6
8.1 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

I would say pid1 would be correct. Computationally, I don't think there is any difference between N (skipped) and D (deletion).

Both skipped and deletion should produce something like this:

``````read       ACGTACGT--ACGTACGT
reference  ACGTACGTAAACGTACGT
``````

So if you add N + D, you are adding the gap twice.

edit **

The SAM specs is a bit confusing on the matter:

``````M alignment match (can be a sequence match or mismatch)
I insertion to the reference
D deletion from the reference
N skipped region from the reference
S soft clipping (clipped sequences present in SEQ)
H hard clipping (clipped sequences NOT present in SEQ)
= sequence match
X sequence mismatch
``````

M can be a match or mismatch??

Further down in the sam specs:

• For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined.
• Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ
3

looks like M is both match and mismatch from the SAM specs. I think bowtie outputs M, H, and Ns. Looking at output from one of my runs using ABI LifeScope, I have CIGAR string of: 20H2S28M. Maybe the various mappers just don't conform to the standards?

3

The sam specification designates MD tag to inform of where the mismatches occur. Most aligners will fill this in correctly in the SAM file. Otherwise you can use calmd command of samtools to fill in this tag.

2

wait that's fishy, then there is no way of determining if it is a match or mismatch, like in 10M could mean anything? Then that would mean there is no way to calculate the identity, I can't believe it...

and the mismatches?

I guess it depends on whether you are doing a mRNA reads-to-genome alignment or genome reads-to-genome alignment? Ns are intronic in mRNA reads to genome. = and Xs are used in genome reads-to-genome?

Thank you DK, i will check that a bit further, but now believe I have to re-align the sequences using smith-waterman or so to get this data. If one can't differentiate a match from a mismatch (facepalm)

I have tried to run samtools calmd on my bam file but receive a segmentation fault. My bam file is the result of BWA-SW (long read algorithm). Will write to the samtools mailing list.

So I accepted this, because it led to the right path. From the CIGAR string alone it is impossible, and from a whole BAM file only when using the correct MD tags which must be set by the aligner or samtools calmd which crashes.

Maybe I should try to run calmd again?

5
8.1 years ago by
Wen.Huang1.2k
Wen.Huang1.2k wrote:

No one seems to have mentioned it, but how about the "NM" tag ("edit distance" in SAM format specification)? I think this is the closest to "mismatches" that can be obtained in a relatively fast way by parsing bam files. One caveat, indels are always going to be a problem when defining identity and I don't know how they are counted in the "NM" tag, perhaps different programs have different ways of counting them.

Actually that was what I was looking for. Unfortunately, these tags are optional, aren't they, at least my file doesn't have them, but if most tools support these, this is obviously the way to go. In any case, my guess is that Levenshtein distance makes most sense, because that handles indels.

2
8.1 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Thanks to the answer from DK and Istvan's comment I conclude that it is impossible to compute percent identity from the CIGAR string alone. This is because the CIGAR string does not contain enough information to capture Levenshtein-distance. Imo this is actually a misguided design of the string.

It might be feasible by obtaining proper tag annotation, but this way is blocked for me at the moment due to a gefault in samtools calmd.

Therefore, I will describe a way of doing the computation by re-aligning the sequences in the bam-file using R functions `pairwiseAlignment` and `stringDist`. I will then define the PID as the quotient of Levenshtein-distance and the length of the maximum length of query, subject.

I will put my code here as a go along.

``````bam = scanBam("reads.bam")
pids = numeric(0)
# calculate coordinates to cut out reference sequences
# this might not work if there are large gaps in the alignment
# on the other hand query width should capture this
start = bam[[1]]\$pos - bam[[1]]\$qwidth
end = bam[[1]]\$pos + 2 * bam[[1]]\$qwidth
start[start < 1] = 1
end[end>length(genome[[1]])] = length(genome[[1]])

for (i in 1:length(bam[[1]]\$seq))  {
# do pairwise alignment against the matched sequence +- 1 query length
# according to Jeremy's comment
aln = pairwiseAlignment(bam[[1]]\$seq[i], subseq(genome[[1]], start= start[i], end=end[i]), type="local")
d = stringDist(c(as.character(aln@pattern), as.character(aln@subject)), m="l", i=F)
l = max(c(width(aln@pattern),(width(aln@subject) )))
aln = NULL
cat (i);
pid = as.numeric((l-d)/l)
print(pid)
pids = c(pids,pid)
gc()
}
``````

Edit: Use only portions of the sequence as suggested by Jeremy, that speeds it up quite a bit.

This will take some time as pairwiseDistance using Smith-Waterman is quite slow and is therefore only feasible with small bam files. For larger files, other alignment tools might be more appropriate. But this demonstrates how it works. (You have to use a loop here, this is intentional, because using lapply with pairwiseAlignment opens a memory-leak, process killed at >20GB mem usage)

1

Can't you extract some reference sequence (a window) around the reported hit? That would be a lot faster than redoing s-w vs the genome.