Getting edit distance from sam tags MD & NM: conceptual misunderstanding?
1
0
Entering edit mode
4.2 years ago

I'm having a hard time implementing a calculation of the edit distance from a sam record.
For a bwa mem alignment, I would resort to taking the NM tag (explicitly defined as the edit distance), but this tag is not set by all aligners, so that's where the trouble starts... Alternatively, I would parse the MD tag to get the number of matches and subtract that from the aligned length. But conceptually that doesn't seem equivalent to the MD tag. I scale both to the aligned read length.

I should clarify this is for analysis of Oxford Nanopore data, which is dominated by indel errors.

I'll add my python code for calculating both "definitions" of the edit distances. I use pysam, but it's more about the logic than the actual code.

Using the NM tag:

read.get_tag("NM")/read.query_alignment_length


So I:

1. take the NM tag which should be the edit distance
2. divide it by the length of the alignment. This excludes clipped sequences from the calculation.

Using the MD tag:

(read.query_alignment_length - sum([int(item) for item in re.split('[ACTG^]', read.get_tag("MD")) if not item == '']))/read.query_alignment_length


So I:

1. Take the MD tag and sum all sequence 'matches' by removing all non-matching ACTG^ characters from the string (is this conceptually correct?)
2. I subtract the number of matches from the aligned read length
3. I divide this by the length of the alignment. This excludes clipped sequences from the calculation.

To me, both calculations sound logical, but as a sanity check, I decided to plot them against each other for a sample dataset aligned with bwa mem to see how these metrics correlate. And this result seems okay (correlated) for many reads but deviates quite a lot for a subset. I think my mistake is conceptual or a misunderstanding of those tags. I assume something about how indels are taken into account is tripping me up.

Below I add a (fancy) scatter plot:
(note the pearsonr is quite okay)

Since this plot is overcrowded by dots here is also the kernel density estimation of the same data:

I think ideally I would take query and reference sequence and calculate the Levenshtein distance, but I'm afraid that's not very efficient for processing thousands of reads. Do you have better solutions?

sam edit distance • 5.6k views
5
Entering edit mode
4.2 years ago

Hi Wouter, first the defintion of NM from bowtie manual

NM:i:<n> The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string. Only present if SAM record is for an aligned read.

Note that edit distances are one-nucleotide edits.

Now, let's take a little complicated case; one of my reads aligned by BWA

MD:Z:24^A24C0A0G10^AA39

If we parse (number sepated from symbols) it will become

24 ^A 24 C 0 A 0 G 10 ^AA 39

If I understood your code correctly (I don't use python, sorry!), then you will calculte MD as

100 - (24 + 24 + 0 + 0 + 10 + 39) = 3

But this will give only the INDEL's edit distance. There are 3 SNPs, (C, A, G) missing from this calculation. This is also evident from the plots that your MD is lesser than NM. In fact, the NM for this is 6, as reported by BWA.

IMO, the correct way to calculate Edit Distance is to calculate the number of transitions while reading the MD from left to right. I'll explain it better:

24 -> ^A : First transition from 24 match to one deletion. NM = 1

24 -> C : 2nd transition. A SNP here. NM = 2

0 -> A : Third transition. Another (consecutive) SNP here. NM = 3

0 -> G : Fourth transition. Yet another (consecutive) SNP. NM = 4

10 ^ AA : Deletions of two = two transitions => NM = 6

You may also like to see this page regarding the INSERTIONS in MD tag

https://github.com/vsbuffalo/devnotes/wiki/The-MD-Tag-in-BAM-Files

0
Entering edit mode

Hi Santosh Anand, Thanks a lot for your reply.

If I understood your code correctly (I don't use python, sorry!), then you will calculte MD as 100 - (24 + 24 + 0 + 0 + 10 + 39) = 3

That seems right, but I'm not sure where you got the 100 from. I assume that's the aligned fragment length.

So while my code first summed the matches, you suggest summing the mismatches. I adapted my code following your suggestion:

sum([len(item) for item in re.split('[0-9^]', read.get_tag("MD"))]) / read.query_alignment_length


So now I:

1. Take the MD string
2. Split on every number or ^ character, retaining only the nucleotides to get the mismatches
3. Sum the length of the nucleotides from the MD tag
4. Divide that by the aligned read length

I asserted this code (without step 4) yields 6 in your example

sum([len(item) for item in re.split('[0-9^]', "24^A24C0A0G10^AA39")])
6


As is evident from the plots below this new calculation is definitely an improvement. The dots deviate less from the bisection, and the Pearson correlation coefficient is now 0.96. But it's not identical to the NM tag calculation, yet. I wonder if insertions are properly accounted for.

1
Entering edit mode

That seems right, but I'm not sure where you got the 100 from. I assume that's the aligned fragment length.

That's correct!

Your new calculation looks right (according to what I said). Could you pull out some examples of discrepant NM and MD tags. Then only we can see how where the error is coming from

0
Entering edit mode

Will do and get back to you soon!

0
Entering edit mode

A few examples with in the first column the NM tag, then the computed edit distance from the MD tag, followed by the MD tag and the CIGAR string. Anything else that you would like me to add?

73      34      21G1C3A1T1A2G4A8T2G4A2G0G0G19^G2G10T6C1A2A0G0G0A5T6^GA5A0G0C0C1G3A5A6C1C2       11684H20M3I7M3I10M1I5M2I1M1I1M3I6M1I4M1I13M2I5M4I3M7I6M1D6M2I1M2I26M1I3M1I2M1I3M2D16M1I6M3I10M10942H
125     55      2A3A3^AAG3^C2^GGA4^G3^GG1A2G1A11A0A3A4C3A3T7G10T8T13C7T0T39C0C9C0C2^C0C14C1T10C0C1C0C0C14T1^TT1C5G4C0C2C0C3C17T1A4A2G3A8        17465H10M3D3M1D2M3D4M1D3M2D4M1I5M1I9M2I13M1I1M2I12M2I4M2I9M2I
81      35      8T2A2T7T4A3A2T1T7T4T3A2T1T6A2A1T7A0A3A2T0T3A2T2A2G2T1C15T0T1T23A1G0A6A5A2       32846H17M1I7M1I1M1I13M1I5M1I12M3I7M1I2M3I11M2I2M1I5M1I7M1I14M1I6M2I4M4I3M2I2M10I2M5I7M2I27M2I1M1I12M1026H
32      10      32G0G7T5T19A0T1^T3^A3^G1A3      42365H21M2I3M3I2M1I2M1I7M6I15M1I3M3I6M3I3M2I8M1D3M1D3M1D5M4662H
50      17      4^G27C0A3A1T8A4G0A11G13A4^C3C0A0G0G0G10G8       225H4M1D4M1I2M2I1M1I23M1I1M2I8M6I2M3I7M1I7M4I1M2I5M3I9M3I4M2I6M1D12M2I15M11609H

2
Entering edit mode

Actually, you are still missing the Insertions, which can be counted from CIGAR string only. The MD will not give the Insertions. See the image below, and if you add INSERTIONS as calculated, the numbers match.

I still have no idea what is happening in 2nd Read though, looks very weird MD & CIGAR.

0
Entering edit mode

Ugh seems the second example wasn't pasted completely, full cigar should have been:
17465H10M3D3M1D2M3D4M1D3M2D4M1I5M1I9M2I13M1I1M2I12M2I4M2I9M2I14M9I3M5I3M2I5M1I14M7I4M2I4M1I3M1I26M2I9M1D9M4I1M1I11M4I16M2I1M2I12M2D9M1I2M3I10M1I9M1I8M4I10M2I8M2I6M19694H

Looks like I'll need to parse the CIGAR string as well then. Thanks for your insight! Will come back with the results of that.

1
Entering edit mode

How's that for a correlation?!

Thanks a bunch!

0
Entering edit mode

Thanks to you too! Made my morning (day) :)

0
Entering edit mode

I might write a short blog post about this journey with some code and obviously have to refer to you for fixing this. Do I use your biostars username?

1
Entering edit mode

Sure, my pleasure! PS: This line is just to keep Biostars character count Nazi away from :)

2
Entering edit mode

And here we go, Friday afternoon blogging: Getting the edit distance from a bam alignment: a journey.

Thanks for joining me on the journey ;-)

0
Entering edit mode

Wow, cool!!! Glanced it and looks very interesting. Unfortunately would be able to read fully after Easter holidays. Big thanks :)

0
Entering edit mode

Wondering, why this bimodal density distribution?

0
Entering edit mode

Yes, I've wondered the same. The average base quality scores show the same bimodal distribution, so presumably, some pores generate lower quality data.

0
Entering edit mode

Can you also add read length (both the clipped and original) ?

0
Entering edit mode

BTW, why are you normalizing wrt read length? Because your reads are trimmed or what ..?

0
Entering edit mode

Since this Nanopore data the read length varies widely, from ~1kb to ~60kb. I don't want longer reads to get penalized more, obviously.