I'm having a hard time implementing a calculation of the edit distance from a sam record.
For a bwa mem alignment, I would resort to taking the NM tag (explicitly defined as the edit distance), but this tag is not set by all aligners, so that's where the trouble starts... Alternatively, I would parse the MD tag to get the number of matches and subtract that from the aligned length. But conceptually that doesn't seem equivalent to the MD tag. I scale both to the aligned read length.
I should clarify this is for analysis of Oxford Nanopore data, which is dominated by indel errors.
I'll add my python code for calculating both "definitions" of the edit distances. I use pysam, but it's more about the logic than the actual code.
Using the NM tag:
- take the NM tag which should be the edit distance
- divide it by the length of the alignment. This excludes clipped sequences from the calculation.
Using the MD tag:
(read.query_alignment_length - sum([int(item) for item in re.split('[ACTG^]', read.get_tag("MD")) if not item == '']))/read.query_alignment_length
- Take the MD tag and sum all sequence 'matches' by removing all non-matching
ACTG^characters from the string (is this conceptually correct?)
- I subtract the number of matches from the aligned read length
- I divide this by the length of the alignment. This excludes clipped sequences from the calculation.
To me, both calculations sound logical, but as a sanity check, I decided to plot them against each other for a sample dataset aligned with bwa mem to see how these metrics correlate. And this result seems okay (correlated) for many reads but deviates quite a lot for a subset. I think my mistake is conceptual or a misunderstanding of those tags. I assume something about how indels are taken into account is tripping me up.
Below I add a (fancy) scatter plot:
(note the pearsonr is quite okay)
Since this plot is overcrowded by dots here is also the kernel density estimation of the same data:
I think ideally I would take query and reference sequence and calculate the Levenshtein distance, but I'm afraid that's not very efficient for processing thousands of reads. Do you have better solutions?