Question

Get number of reads with a single, (almost) exact match to the full length of a reference sequence

0

Entering edit mode

21 months ago

Nat • 0

I can't find an answer to this in previous questions, so hoping someone can help me now.

We have nanopore sequenced a PCR product, and I've filtered our reads to +/- 10 bp of the expected product length (~4,400bp), leaving ~55,000 reads.

We now want to work out how many of the reads correspond to each of 6 different potential arrangements. The section we're interested in is in the middle of the 4,400 bp PCR product, and each different potential arrangement is ~1,500 bp. We have reference sequences for each of the 1,500 bp arrangements. Basically, we want to know what portion of our starting cells had each of the different arrangements of genes in this operon (e.g arrangement 1=30%, arrangement 2=15%, etc.).

This seems like it should be a really easy thing to do using read mapping to count alignments for each reference separately, but I just can't get my head around it. No matter what options I've tried with minimap2 and samtools, every read maps every time on every different arrangement. I think the problem is that genes in our reference sequences are all present in every read, just in different arrangements. So we need to only count alignments which align to the whole 1,500 bp reference sequence, not partial hits or hits with huge gaps in them.

I have tried the minimap2 options -r, -G and --secondary=no to restrict the output to single alignments with no large gaps, and combined that with samtools view -m 1000 to output only alignments longer than 1,000 bp (this is far stricter than it needs to be to only output alignments to any of the different potential arrangements). But still, every time, every read is mapping to every reference, usually multiple times.

What am I doing wrong?! Am I totally over-complicating this, or doing something obviously stupid? Almost feel like using blastn might be better at this point...

TL;DR: how do I use minimap2 and/or samtools (or any other tool better suited to this) to count how many of our ~4,400 bp reads contain only full-length, almost exact matches to our 1,500 bp reference sequence, excluding any partial hits?

My latest attempt (broken into individual steps, though I'd usually pipe the samtools steps together, also not bothered with samtools sort at this point):

minimap2 -ax map-ont -r10 -N5 --secondary=no --no-long-join reference.fasta reads.fastq > reads.minimap.sam
samtools view -m 1000 -b -o reads.minimap.bam reads.minimap.sam
samtools -c reads.minimap.bam

minimap nanopore samtools • 595 views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 21 months ago by Nat • 0

0

Entering edit mode

Almost feel like using blastn might be better at this point...

Actually using a global alignment tool may be better. You could try needleall: http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html

ADD REPLY • link 21 months ago by GenoMax 141k