Oxford Nanopore Variant Calling Pipeline Calls Very Few of Variants of NA12878
2
0
Entering edit mode
3 months ago
hkarakurt ▴ 150

Hello everyone, First, I am so sorry for this long and very amateur question. I am trying to build a pipeline for SNP calling for Oxford Nanopore MinION based long reads. I need to test the pipeline but apparently the number of test data is really low. I only have Na12878 data from this address: https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md

I downloaded the FAST5 data coded as "FAB43577" (it is said that data has 427,215 reads and 2,776,702,333 bases). I used Guppy V5.0.1 as basecaller with the command:

guppy_basecaller -i /home/huk/Desktop/nanopore_data/na12878_fast5/data2/UCSC/FAB43577-3574887596_Multi -s /home/huk/Desktop/nanopore_data/na12878_fast5/data2/guppy_out -c dna_r9.4.1_450bps_fast.cfg --trim_barcodes --trim_strategy dna --num_callers 1 --cpu_threads_per_caller 12


Then I merged all FASTQ files inside the "pass" folder of Guppy results with "cat" command and obtained single FASTQ.

minimap2 -ax map-ont -t 12 /home/huk/Desktop/references/hg38/hg38.mmi /home/huk/Desktop/nanopore_data/na12878_fast5/data2/guppy_out/pass/all_data.fastq --MD > /home/huk/Desktop/nanopore_data/na12878_fast5/data2/minimap_output/mapped_12878_2.md.sam


I transformed the SAM file to BAM file with samtools. I indexed and sorted the file as well. Then I used longshot for variant calling only on chr20 via the command:

longshot --bam /home/huk/Desktop/nanopore_data/na12878_fast5/data2/minimap_output/mapped_12878_2_sorted_md.bam --ref /home/huk/Desktop/references/hg38/hg38.fa -F -r chr20 --out /home/huk/Desktop/nanopore_data/na12878_fast5/data2/vcf_output/longshot_result.vcf


My final VCF have 827 (without filtering) variants. I downloaded the high confidence VCF file of NA12878 from this link https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh38/supplementaryFiles/

In this VCF, chr20 have about 67957 SNPs. I compared the variants and only 8 of them are common in both VCFs.

I also used nanopolish index and nanopolish variants for variant calling but the final VCF is completely empty (only headers and comments of standard VCF).

I am not sure why I have very low number of variants. If anyone can give me a hint or tell me what I am doing wrong I would be really grateful. I am completely stuck here. If you know another test data (if there is) for variant calling of Oxford Nanopore MinION, it would be awesome too.

Variant Calling SNP MinION ONT • 542 views
0
Entering edit mode

Maybe is off-topic but if you are testing a pipeline for SNP calling with ONT-reads you should use the high-accuracy mode (dna_r9.4.1_450bps_hac.cfg) instead of the in fast mode (dna_r9.4.1_450bps_fast.cfg)

0
Entering edit mode

Actually high accuracy mode is taking too long. But I also tested the pipeline on basecalled FASTQ reads that I downloaded from same web page. Results are the same.

2
Entering edit mode
3 months ago
colindaven ★ 3.5k

Check the depth of coverage you have, eg with samtools depth, sambamba depth, etc. I don't think you have enough data aligned to chr20.

Use samtools stats, fastqc, and multiqc to further check your data.

2
Entering edit mode
12 weeks ago

Hi there,

You're only using one flowcell of data and the average coverage on the genome will only be about 0.9x coverage. This is definitely going to limit your ability to detect any type of genetic event. As mentioned in the thread, you should calculate the coverage using something like samtools or NanoPlot. The last time I looked, although both fastqc and multiqc can take long reads as input, they produce metrics that are not really relevant for ONT data (e.g. fastqc still produced a plot looking for Illumina adapters) so I'd probably use Nanoplot

Currently (Oct 2021), keep in mind that Longshot is a SNP only detection tool but your GIAB "truthset" will probably have indels in it as well, so you should really filter those out if you continue to use Longshot, otherwise your comparison isn't completely correct (or at least you may artificially inflate your FN rate). In any case, you could consider using LRA (https://github.com/ChaissonLab/LRA) for the alignment although Minimap2 will be probably be fine. For the SNP/indel detection, you should consider Clair3 (https://github.com/HKU-BAL/Clair3) or even Medaka (https://github.com/nanoporetech/medaka) - both of these tools are producing pretty good benchmarks these days (see the respective links).

In terms of additional datasets, have a look here: https://labs.epi2me.io/dataindex/

0
Entering edit mode

Thank you for amazing answer. So, basically using multiple flowcells will increase the number of variants theoretically. I am only looking for SNPs so Longshot is still okay for me but I will try using the tools you mentioned. Also, thank you for additional datasets link.