Hello, I’m hoping for some guidance. I’m working with ONT reads and want to randomly select 7,000 reads from a fastq.gz, keeping only those with Phred > 15. I use Chopper for quality filtering, a fixed random.seed for reproducibility, and reservoir sampling to draw the 7000 reads.
After mapping this subset to a reference with minimap2 and viewing the alignments in IGV, I observe what looks like a “funnel” pattern in the pileup (pic 1). To address this, I added a minimum-coverage filter for read acceptance and set it to 70; that produced pic 2, but some reads still don’t align end-to-end. Increasing mincov to 90 gave pic 3.
Because many reads pass Phred > 15, I sometimes choose 7000–12000 reads to keep runtime reasonable. However, when the resulting alignments don’t fully cover the reference, I worry I’m losing information. Is this expected?
Basecalling was performed for all Nanopore runs with Guppy version 3.2.9 or 4.3.4
In my scriprt I dont trim reads.
I’m puzzled because, using EPI2ME with only Phred > 15 and reads = 7000, the BAM it produces does not show the same profile in IGV; specifically, I do not see the left-flank (5′) coverage drop-off there.
PS: My ultimate goal is variant calling.
This might be a basic question, but I’m a bit lost. Any insights would be appreciated. I’m hoping for some guidance.
pic1:
pic2
pic3
Minimum-coverage logic:
For each read, I iterate over minimap2 hits and accept the read if at least one hit has coverage_read >= min_coverage, where:
aligner = mp.Aligner(reference_genome, preset="map-ont")
coverage_read_percent = (hit.q_en - hit.q_st) / len(read) * 100
hit.q_st, hit.q_en: start/end of the aligned segment on the read (query coordinates) len(read): total read length from the FASTQ
What locus are we even looking at?
thank you for replying it's cytochrome b gene
With the same 7000 reads you select that you also manually align with
minimap2? Are you using default parameters for alignment? This is a multi-copy gene?yes , I’m using minimap2 with the default -x map-ont preset (no extra flags beyond read-group), then filtering with samtools view -F 2304 to drop secondary and supplementary alignments. Then I sort and index
It looks like ONT is using sligntly different pre-sets for their EPI2ME protocols. See the post explaining the change here --> https://epi2me.nanoporetech.com/how-to-align/ See if you are able to get similar results if you switch.
Is this data from whole genome sequencing or are you using selective sequencing to just get
cyt bgene. Are you using the complete genome as reference?