I seem to be getting a malformed bam from bwa mem
After bwa mem & using GRCh38DH, 1000 Genomes Project version as ref I run ValidateSamFile and get many errors as follows:
ERROR::INVALID_TAG_NM:Record 349909, Read name LH00400:21:22N52NLT4:7:1101:27171:1154, NM tag (nucleotide differences) in file [0] does not match reality [1]
I inspect one of the offending reads and see:
CIGAR: 101M48S
RNAME: chr2
POS: 16145020
NM tag: NM:i:0
I pulled the corresponding ref sequence to compare to the read sequence:
samtools faidx my_ref.fa chr2:16145020-16145168
The read and ref are identical for the first 100 bases, eg:
ref[:100] == read[:100]
True
but not after that:
ref[:101] == read[:101]
False
starting with base 101 ref is all 'N''s
ref[100:]
'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'
what is going on here?
My main two questions are:
- Why does the CIGAR indicate 101M when it seems only the first 100 bases match between ref and read?
- where exactly is this INVALID_TAG_NM error coming from and how to fix (or can this be ignored)? I tried running
samtools calmd -br {input.bam} {input.ref}
and I can get ValidateSamFile to pass, but am concerned about why I would have to do this at all.
Thanks for the response
I think I am more wondering if this is a typical thing that people see and am wondering what is wrong with my data that might be causing it (which I am struggling a bit to grasp)
A google search of "ERROR::INVALID_TAG_NM" produces hardly any hits. I tried BWA MEM v0.7.18, 0.7.19 & BWA MEM and getting the same error with all of them.
The error is happening for ~0.5% of my reads, so it is kind of a lot.
Do you think running
samtools calmd -br {input.bam} {input.ref}
post-alignment is a reasonable solution to fixing the error?