Upstream (5′) Coverage Depletion Observed in IGV
1
0
Entering edit mode
12 days ago
safeassli ▴ 10

Hello, I’m hoping for some guidance. I’m working with ONT reads and want to randomly select 7,000 reads from a fastq.gz, keeping only those with Phred > 15. I use Chopper for quality filtering, a fixed random.seed for reproducibility, and reservoir sampling to draw the 7000 reads.

After mapping this subset to a reference with minimap2 and viewing the alignments in IGV, I observe what looks like a “funnel” pattern in the pileup (pic 1). To address this, I added a minimum-coverage filter for read acceptance and set it to 70; that produced pic 2, but some reads still don’t align end-to-end. Increasing mincov to 90 gave pic 3.

Because many reads pass Phred > 15, I sometimes choose 7000–12000 reads to keep runtime reasonable. However, when the resulting alignments don’t fully cover the reference, I worry I’m losing information. Is this expected?

Basecalling was performed for all Nanopore runs with Guppy version 3.2.9 or 4.3.4

In my scriprt I dont trim reads.

I’m puzzled because, using EPI2ME with only Phred > 15 and reads = 7000, the BAM it produces does not show the same profile in IGV; specifically, I do not see the left-flank (5′) coverage drop-off there.

PS: My ultimate goal is variant calling.

This might be a basic question, but I’m a bit lost. Any insights would be appreciated. I’m hoping for some guidance.

pic1:enter image description here

pic2enter image description here

pic3enter image description here

igv ont • 774 views
ADD COMMENT
0
Entering edit mode

Minimum-coverage logic:

For each read, I iterate over minimap2 hits and accept the read if at least one hit has coverage_read >= min_coverage, where:

aligner = mp.Aligner(reference_genome, preset="map-ont")

coverage_read_percent = (hit.q_en - hit.q_st) / len(read) * 100

hit.q_st, hit.q_en: start/end of the aligned segment on the read (query coordinates) len(read): total read length from the FASTQ

ADD REPLY
0
Entering edit mode

What locus are we even looking at?

ADD REPLY
0
Entering edit mode

thank you for replying it's cytochrome b gene

ADD REPLY
0
Entering edit mode

I’m puzzled because, using EPI2ME with only Phred > 15 and reads = 7000, the BAM it produces does not show the same profile in IGV; specifically, I do not see the left-flank (5′) coverage drop-off there.

With the same 7000 reads you select that you also manually align with minimap2? Are you using default parameters for alignment? This is a multi-copy gene?

ADD REPLY
0
Entering edit mode

yes , I’m using minimap2 with the default -x map-ont preset (no extra flags beyond read-group), then filtering with samtools view -F 2304 to drop secondary and supplementary alignments. Then I sort and index

ADD REPLY
0
Entering edit mode

It looks like ONT is using sligntly different pre-sets for their EPI2ME protocols. See the post explaining the change here --> https://epi2me.nanoporetech.com/how-to-align/ See if you are able to get similar results if you switch.

Is this data from whole genome sequencing or are you using selective sequencing to just get cyt b gene. Are you using the complete genome as reference?

ADD REPLY
0
Entering edit mode
1 day ago

Hello,

Thanks for the detailed description and screenshots—this is a classic conundrum with long-read (ONT) data, and it's great that you're digging into it for variant calling. I'll walk through what might be going on with the "funnel" (coverage drop-off, especially at the 5' end), why EPI2ME behaves differently, and some practical steps to troubleshoot/fix it. I've seen similar patterns before, often tied to alignment artifacts rather than your sampling itself.

Quick Interpretation of the Patterns

  • Pic 1 (no mincov filter): The funnel shape is likely due to soft-clipping during alignment. Minimap2 (great choice for ONT) aggressively clips low-quality or mismatched bases at read ends to maximize alignment score. Nanopore reads inherently have higher error rates (~5-15% overall, worse at ends due to basecalling artifacts), so the 5' (left) flank drops off as mappers discard those noisy tails. This isn't "losing information" per se—it's the aligner being conservative—but it can bias pileups and downstream variant calling if not accounted for.

  • Pic 2/3 (mincov 70/90): Adding a coverage filter helps by rejecting partial alignments, but you're still seeing incomplete end-to-end coverage. This suggests some reads are inherently chimeric/fragmented or have uncallable ends from Guppy basecalling (v3.2.9 is older; v4.3.4 is better but still prone to this). The fact that runtime stays reasonable with 7k-12k reads is fine—ONT variant callers like Clair3 or PEPPER can handle that subsample for testing.

  • Why EPI2ME looks cleaner: EPI2ME Labs (Oxford Nanopore's workflow) uses a more integrated pipeline:

    • It often includes adaptive sampling or post-basecalling trimming (e.g., via Poretools or Filtlong) to remove low-quality ends before alignment.
    • Minimap2 is run with ONT-optimized params (e.g., -ax map-ont --secondary=no), and it post-processes BAMs to minimize clipping artifacts.
    • No explicit trimming in your script is good (avoids over-trimming), but EPI2ME implicitly handles end biases via its Dorado basecaller or Guppy re-squiggling.

For variant calling, this funnel can be expected (ONT reads aren't perfect), but you don't want it skewing your calls—e.g., false negatives at flanks due to low coverage.

Is This Expected? Short Answer: Yes, But Fixable

Yes, partial alignments are common in raw ONT data (especially older Guppy versions), but the discrepancy with EPI2ME points to pipeline differences. Your sampling (Chopper + reservoir) is solid for reproducibility and quality gating (Phred >15 is a good threshold; I'd bump to >20 if possible for cleaner ends). The issue is likely in mapping/post-alignment.

Suggestions to Improve

  1. Tweak Minimap2 Parameters:

    • Use ONT-specific presets and disable secondary alignments to reduce chimeras/artifacts.
    • Add --end-bonus or --end-penalty to penalize/reward end matches less aggressively (helps with funnels).

    Example command (assuming your ref is ref.fa and subset FASTQ is subset.fastq.gz):

    minimap2 -ax map-ont \
             --secondary=no \
             -t 8 \
             --end-bonus 10 \
             ref.fa subset.fastq.gz | \
    samtools sort -o aligned.bam - && \
    samtools index aligned.bam
    
    • --secondary=no: Ignores supplementary alignments (often noisy ends).
    • --end-bonus 10: Slightly favors full-span alignments without over-clipping.
    • Re-run and reload in IGV—should reduce the 5' drop-off.
  2. Add Light Trimming for Ends (Optional, But Targeted):

    • Since you don't trim now, try a gentle end-trim on low-Phred tails after Chopper. Use Nanofilt (from NanoPack)—it's fast and preserves read IDs.
      # After Chopper, before minimap2
      NanoFilt --headcrop 50 --tailcrop 50 --maxlength 10000 subset.fastq > trimmed.fastq  # Adjust crops based on your read lengths
      
      • --headcrop 50: Trims first 50 bp (common noisy 5' region in ONT).
      • This mimics what EPI2ME does implicitly. Test on a small subset first.
  3. Filter Post-Alignment:

    • Your mincov=90 is aggressive; try mapping first, then filter BAM for full-coverage reads:
      # After alignment, extract reads spanning full ref (adjust for your ref length, e.g., 1000 bp)
      samtools view -h aligned.bam | awk '$6 !~ /S/ && $9 >= 1 && $10 >= 1000' | samtools view -b > fullcov.bam
      
      • $6 !~ /S/: No soft-clipping in CIGAR.
      • This keeps only end-to-end aligns. If too few reads remain, relax to allow minor clips (e.g., <10 bp).
  4. Basecalling Check:

    • Guppy 3.2.9 is outdated—re-basecall a sample with latest (v6+) + --trim_barcodes or --trim_adapter if applicable. This often fixes end biases without manual intervention.
    • For variant calling, use Dorado (ONT's new basecaller) if you can—it's faster and produces squiggled reads with better end quality.
  5. For Variant Calling:

    • Once alignments look better, subsample to 7k-10k high-quality reads and run:
      • Clair3: Handles ONT noise well; use --bam_fn=aligned.bam --ref_fn=ref.fa --platform=ont.
      • Or PEPPER-Margin-DeepVariant for higher accuracy on subsamples.
    • If coverage is still patchy, increase your sample to 15k reads—ONT callers are robust to that.
    • Validate with known variants (e.g., from short-read data) to ensure no flank bias.

Next Steps

  • Re-run with the minimap2 tweaks above and share the new IGV screenshot if it persists—we can iterate.
  • If your ref is microbial/small, this should resolve quickly; for larger genomes, consider chunking.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3300 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6