Question

How to discard alignments from SAM/BAM file that have mismatches

0

Entering edit mode

10 months ago

Mo ▴ 40

Hi all,

I have sequenced the mRNA of a heterologous library expressed in human cells using nanopore. Then I mapped the reads from the fastq files to the reference database of the library using the minimap.

Then I filtered the SAM file to retain the primary reads with high mapping quality. After this, I visualised the BAM file in IGV:

enter image description here

As you can see there are some reads having mismatches to the reference database. I checked even the highest mapping quality reads are having mismatches. While I am guessing that these reads are mapping to more than one variant in the library dataset, I am looking for a command line that I could use to discard such alignments from SAM/BAM file. Unlike the BWA aligner, the minimap does not generate the MD (Mismatching positions/bases) & XM (Number of mismatches in the alignment) flags in the SAM file. It generates the NM:i flag that informs about the mismatches and INDELS together, so this is not really helpful as I am just looking to discard alignments with mismatches.

I would really appreciate the help to discard alignments with mismatches from SAM/BAM file (better if I can also give a mismatch threshold). Thanks alot.

BAM SAM minimap nanopore • 1.8k views

ADD COMMENT • link updated 6 months ago by Brian Bushnell 20k • written 10 months ago by Mo ▴ 40

score 1 · Answer 1 · 2023-06-24

1

Entering edit mode

10 months ago

GenoMax 141k

minimap2 will generate MD tags. You could also use the second option to identify and the filter reads with perfect matches (if that is what you want).

--MD    Output the MD tag (see the SAM spec).
--eqx   Output =/X CIGAR operators for sequence match/mismatch.

ADD COMMENT • link 10 months ago by GenoMax 141k

1

Entering edit mode

Also, this is cross posted on the minimap2 github repo. Please don't cross-post without reference.

ADD REPLY • link 10 months ago by Rob 6.5k

1

Entering edit mode

Hi Rob,

Thanks, I will take care of it next time.

ADD REPLY • link 10 months ago by Mo ▴ 40

0

Entering edit mode

Hi Geno,

could you please help me with a command line that I can use with MD tag to retain perfect matches? Sorry, I am new to data analysis. Many thanks.

ADD REPLY • link 10 months ago by Mo ▴ 40

score 1 · Answer 2 · 2023-06-26

While not a complete solution for your problem, this might help you further. The problem is that nanopore reads vary hugely in length, from tens of basepairs up to megabases, so mismatch filters must be dynamic relative to read length to make sense.

Samtools can help you out here if you have bam files. eg here to get those with less than 10% or 20% mismatches, just using the NM tag. You should be able to use this with other tags, eg MD.

The wc -l just gives you numbers - read counts, leave out to get the sequences in SAM format.

 # works with samtools 1.16.1 - must upgrade samtools from 1.11

samtools view -e '[NM]/length(seq) < 0.1' test.sorted.bam | wc -l
7
samtools view -e '[NM]/length(seq) < 0.2' test.sorted.bam | wc -l
12

score 0 · Answer 3 · 2023-10-17

BBTools has a tool for this purpose, "filtersam.sh". It works on bam too.

filtersam.sh in=mapped.sam out=filtered.sam ref=ref.fa

This will remove reads with multiple unsupported variations (the exact number and type is configurable). The goal was to remove junky Novaseq reads, where it's generally fine to discard all the reads with errors, since most reads don't have any... I'm not sure how good of an idea it is for Nanopore, though this data does look pretty clean.