I'm having a hard time implementing a calculation of the edit distance from a sam record.
For a bwa mem alignment, I would resort to taking the NM tag (explicitly defined as the edit distance), but this tag is not set by all aligners, so that's where the trouble starts... Alternatively, I would parse the MD tag to get the number of matches and subtract that from the aligned length. But conceptually that doesn't seem equivalent to the MD tag. I scale both to the aligned read length.
I should clarify this is for analysis of Oxford Nanopore data, which is dominated by indel errors.
I'll add my python code for calculating both "definitions" of the edit distances. I use pysam, but it's more about the logic than the actual code.
Using the NM tag:
read.get_tag("NM")/read.query_alignment_length
So I:
- take the NM tag which should be the edit distance
- divide it by the length of the alignment. This excludes clipped sequences from the calculation.
Using the MD tag:
(read.query_alignment_length - sum([int(item) for item in re.split('[ACTG^]', read.get_tag("MD")) if not item == '']))/read.query_alignment_length
So I:
- Take the MD tag and sum all sequence 'matches' by removing all non-matching
ACTG^
characters from the string (is this conceptually correct?) - I subtract the number of matches from the aligned read length
- I divide this by the length of the alignment. This excludes clipped sequences from the calculation.
To me, both calculations sound logical, but as a sanity check, I decided to plot them against each other for a sample dataset aligned with bwa mem to see how these metrics correlate. And this result seems okay (correlated) for many reads but deviates quite a lot for a subset. I think my mistake is conceptual or a misunderstanding of those tags. I assume something about how indels are taken into account is tripping me up.
Below I add a (fancy) scatter plot:
(note the pearsonr is quite okay)
Since this plot is overcrowded by dots here is also the kernel density estimation of the same data:
I think ideally I would take query and reference sequence and calculate the Levenshtein distance, but I'm afraid that's not very efficient for processing thousands of reads. Do you have better solutions?
Hi Santosh Anand, Thanks a lot for your reply.
That seems right, but I'm not sure where you got the 100 from. I assume that's the aligned fragment length.
So while my code first summed the matches, you suggest summing the mismatches. I adapted my code following your suggestion:
So now I:
I asserted this code (without step 4) yields 6 in your example
As is evident from the plots below this new calculation is definitely an improvement. The dots deviate less from the bisection, and the Pearson correlation coefficient is now 0.96. But it's not identical to the NM tag calculation, yet. I wonder if insertions are properly accounted for.
That's correct!
Your new calculation looks right (according to what I said). Could you pull out some examples of discrepant NM and MD tags. Then only we can see how where the error is coming from
Will do and get back to you soon!
A few examples with in the first column the NM tag, then the computed edit distance from the MD tag, followed by the MD tag and the CIGAR string. Anything else that you would like me to add?
Actually, you are still missing the Insertions, which can be counted from CIGAR string only. The MD will not give the Insertions. See the image below, and if you add INSERTIONS as calculated, the numbers match.
I still have no idea what is happening in 2nd Read though, looks very weird MD & CIGAR.
Ugh seems the second example wasn't pasted completely, full cigar should have been:
17465H10M3D3M1D2M3D4M1D3M2D4M1I5M1I9M2I13M1I1M2I12M2I4M2I9M2I14M9I3M5I3M2I5M1I14M7I4M2I4M1I3M1I26M2I9M1D9M4I1M1I11M4I16M2I1M2I12M2D9M1I2M3I10M1I9M1I8M4I10M2I8M2I6M19694H
Looks like I'll need to parse the CIGAR string as well then. Thanks for your insight! Will come back with the results of that.
How's that for a correlation?!
Thanks a bunch!
Thanks to you too! Made my morning (day) :)
I might write a short blog post about this journey with some code and obviously have to refer to you for fixing this. Do I use your biostars username?
Sure, my pleasure! PS: This line is just to keep Biostars character count Nazi away from :)
And here we go, Friday afternoon blogging: Getting the edit distance from a bam alignment: a journey.
Thanks for joining me on the journey ;-)
Wow, cool!!! Glanced it and looks very interesting. Unfortunately would be able to read fully after Easter holidays. Big thanks :)
Wondering, why this bimodal density distribution?
Yes, I've wondered the same. The average base quality scores show the same bimodal distribution, so presumably, some pores generate lower quality data.
Can you also add
read length
(both the clipped and original) ?BTW, why are you normalizing wrt read length? Because your reads are trimmed or what ..?
Since this Nanopore data the read length varies widely, from ~1kb to ~60kb. I don't want longer reads to get penalized more, obviously.