Hello!
We perform 16S profiling of microbial community. We sequenced the library of PCR-amplified 16S sequences on R10.4.1 pores and basecalled the reads with Dorado. The problem is that when I get the statistics of same fastq file with different software, I get different results.
1) When the data is analysed by Nanopore software (either with 16s-wf pipeline
which generates read quality statistics in the final report or with NanoPlot
), I get average read Q-score of ~21.
2) When the data is analysed with fastqc + multiqc
or seqkit
, I get average read Q-score >30.
seqkit
output:
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%)
all_raw.fastq.gz FASTQ DNA 2,429,312 3,492,160,612 1 1,437.5 79,704 1,461 1,495 1,504 0 1,496 86.3 75.34
Where does such huge descrepancy come from?
I also asked this question on github, there are pictures from fastqc + multiqc
and wf-16s pipeline
reports there:
https://github.com/epi2me-labs/wf-16s/issues/39
Thank you!
Are you only looking at reads that satisfy this filter (which is in your command line in GitHub post) in your
seqkit
andfastqc
reports?With this filter do the read numbers going into the two programs match?
Thank you for answer. No, I am looking at all reads before appllying any filters.
There are 2,429,312 raw reads, and in 16s-wf pipeline I got after applying filters
--min_len 1400 --max_len 1600 --min_read_qual 10
a total of 2,049,340 reads.