Oxford Nanopore Variant Calling Pipeline Calls Very Few of Variants of NA12878
2
0
Entering edit mode
6 weeks ago
hkarakurt ▴ 130

Hello everyone, First, I am so sorry for this long and very amateur question. I am trying to build a pipeline for SNP calling for Oxford Nanopore MinION based long reads. I need to test the pipeline but apparently the number of test data is really low. I only have Na12878 data from this address: https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md

I downloaded the FAST5 data coded as "FAB43577" (it is said that data has 427,215 reads and 2,776,702,333 bases). I used Guppy V5.0.1 as basecaller with the command:

guppy_basecaller -i /home/huk/Desktop/nanopore_data/na12878_fast5/data2/UCSC/FAB43577-3574887596_Multi -s /home/huk/Desktop/nanopore_data/na12878_fast5/data2/guppy_out -c dna_r9.4.1_450bps_fast.cfg --trim_barcodes --trim_strategy dna --num_callers 1 --cpu_threads_per_caller 12

Then I merged all FASTQ files inside the "pass" folder of Guppy results with "cat" command and obtained single FASTQ.

minimap2 -ax map-ont -t 12 /home/huk/Desktop/references/hg38/hg38.mmi /home/huk/Desktop/nanopore_data/na12878_fast5/data2/guppy_out/pass/all_data.fastq --MD > /home/huk/Desktop/nanopore_data/na12878_fast5/data2/minimap_output/mapped_12878_2.md.sam

I transformed the SAM file to BAM file with samtools. I indexed and sorted the file as well. Then I used longshot for variant calling only on chr20 via the command:

longshot --bam /home/huk/Desktop/nanopore_data/na12878_fast5/data2/minimap_output/mapped_12878_2_sorted_md.bam --ref /home/huk/Desktop/references/hg38/hg38.fa -F -r chr20 --out /home/huk/Desktop/nanopore_data/na12878_fast5/data2/vcf_output/longshot_result.vcf

My final VCF have 827 (without filtering) variants. I downloaded the high confidence VCF file of NA12878 from this link https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh38/supplementaryFiles/

In this VCF, chr20 have about 67957 SNPs. I compared the variants and only 8 of them are common in both VCFs.

I also used nanopolish index and nanopolish variants for variant calling but the final VCF is completely empty (only headers and comments of standard VCF).

I am not sure why I have very low number of variants. If anyone can give me a hint or tell me what I am doing wrong I would be really grateful. I am completely stuck here. If you know another test data (if there is) for variant calling of Oxford Nanopore MinION, it would be awesome too.

Thank you in advance.

Variant Calling SNP MinION ONT • 448 views
ADD COMMENT
0
Entering edit mode

Maybe is off-topic but if you are testing a pipeline for SNP calling with ONT-reads you should use the high-accuracy mode (dna_r9.4.1_450bps_hac.cfg) instead of the in fast mode (dna_r9.4.1_450bps_fast.cfg)

ADD REPLY
0
Entering edit mode

Actually high accuracy mode is taking too long. But I also tested the pipeline on basecalled FASTQ reads that I downloaded from same web page. Results are the same.

ADD REPLY
2
Entering edit mode
6 weeks ago
colindaven ★ 3.3k

Check the depth of coverage you have, eg with samtools depth, sambamba depth, etc. I don't think you have enough data aligned to chr20.

Use samtools stats, fastqc, and multiqc to further check your data.

ADD COMMENT
2
Entering edit mode
6 weeks ago
adoran ▴ 10

Hi there,

You're only using one flowcell of data and the average coverage on the genome will only be about 0.9x coverage. This is definitely going to limit your ability to detect any type of genetic event. As mentioned in the thread, you should calculate the coverage using something like samtools or NanoPlot. The last time I looked, although both fastqc and multiqc can take long reads as input, they produce metrics that are not really relevant for ONT data (e.g. fastqc still produced a plot looking for Illumina adapters) so I'd probably use Nanoplot

Currently (Oct 2021), keep in mind that Longshot is a SNP only detection tool but your GIAB "truthset" will probably have indels in it as well, so you should really filter those out if you continue to use Longshot, otherwise your comparison isn't completely correct (or at least you may artificially inflate your FN rate). In any case, you could consider using LRA (https://github.com/ChaissonLab/LRA) for the alignment although Minimap2 will be probably be fine. For the SNP/indel detection, you should consider Clair3 (https://github.com/HKU-BAL/Clair3) or even Medaka (https://github.com/nanoporetech/medaka) - both of these tools are producing pretty good benchmarks these days (see the respective links).

In terms of additional datasets, have a look here: https://labs.epi2me.io/dataindex/

ADD COMMENT
0
Entering edit mode

Thank you for amazing answer. So, basically using multiple flowcells will increase the number of variants theoretically. I am only looking for SNPs so Longshot is still okay for me but I will try using the tools you mentioned. Also, thank you for additional datasets link.

ADD REPLY

Login before adding your answer.

Traffic: 2416 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6