Discrepancy in Q-score assessment of ONT reads in Nanopore and third-party software repors
2
0
Entering edit mode
5 hours ago
k-tarasov ▴ 10

Hello! We perform 16S profiling of microbial community. We sequenced the library of PCR-amplified 16S sequences on R10.4.1 pores and basecalled the reads with Dorado. The problem is that when I get the statistics of same fastq file with different software, I get different results. 1) When the data is analysed by Nanopore software (either with 16s-wf pipeline which generates read quality statistics in the final report or with NanoPlot), I get average read Q-score of ~21.

2) When the data is analysed with fastqc + multiqc or seqkit, I get average read Q-score >30.

seqkit output:

file              format  type   num_seqs        sum_len  min_len  avg_len  max_len     Q1     Q2     Q3  sum_gap    N50  Q20(%)  Q30(%)
all_raw.fastq.gz  FASTQ   DNA   2,429,312  3,492,160,612        1  1,437.5   79,704  1,461  1,495  1,504        0  1,496    86.3   75.34

Where does such huge descrepancy come from?

I also asked this question on github, there are pictures from fastqc + multiqc and wf-16s pipeline reports there:

https://github.com/epi2me-labs/wf-16s/issues/39

Thank you!

seqkit q-score fastqc 16s-wf nanoplot • 106 views
ADD COMMENT
0
Entering edit mode

Are you only looking at reads that satisfy this filter (which is in your command line in GitHub post) in your seqkit and fastqc reports?

--min_len 1400 --max_len 1600 --min_read_qual 10 

With this filter do the read numbers going into the two programs match?

ADD REPLY
0
Entering edit mode

Thank you for answer. No, I am looking at all reads before appllying any filters.

There are 2,429,312 raw reads, and in 16s-wf pipeline I got after applying filters --min_len 1400 --max_len 1600 --min_read_qual 10 a total of 2,049,340 reads.

ADD REPLY
2
Entering edit mode
4 hours ago
k-tarasov ▴ 10

Thanks to colindaven answer, I managed to travel by hyperlinks to the source of the discrepancy. It seems that Nanopore tools compute average Q-score in the following way: they convert individual Q-scores to error rates, then compute the average error and then convert average error back to some Q-score value. This Q-score value is considered the average Q-score. Unlike Nanopore tools, third-party tools such as FastQC, seqkit, fastp, fastplong - as far as I know, just compute the average Q-score summing across all Q-scores and dividing by the total number of reads. This gives overestimated Q-score.

Useful links to read more about it:

https://gigabaseorgigabyte.wordpress.com/2017/06/26/averaging-basecall-quality-scores-the-right-way/

https://community.nanoporetech.com/posts/what-is-the-base-value-for

Dorado lines where mean Q-score is computed:

https://github.com/nanoporetech/dorado/blob/a7fb3e3d4afa7a11cb52422e7eecb1a2cdb7860f/dorado/utils/sequence_utils.cpp#L132

ADD COMMENT
1
Entering edit mode
5 hours ago

I agree completely. I think the cause is well discussed here : https://github.com/OpenGene/fastplong/issues/20

fastplong currently still appears to have this problem though, so I would trust the nanoplot or chopper results more. This issue will have big effects on Q score filtering prior to assembly or alignment.

ADD COMMENT

Login before adding your answer.

Traffic: 4614 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6