Question

Deduplication rate of shotgun metagenomes using fastqc

0

Entering edit mode

6 months ago

vikasmh111 • 0

Hello, I am doing a shotgun metagenome analysis of gut sample. I got shotgun metagenome data generated on Illumina platform. Fastqc report on fastq file indicated my metagenome has 52% duplicacy. I processed the file with bowtie2 for human read removal, deduplication using fastp (using accuracy 6) and trimmed using trimmomatic. Lost about 15% of the reads in processing. Fastqc estimated presence of 51.5% duplicacy rate for the processed file, while fastp and HTS_super_deduper estimated presence of less than 0.5% duplicate reads on the processed files (for the unprocessed file it was 9% and 12%). Is fastqc a reliable tool for estimation of duplication of reads in metagenomes? Is there any recommended tool other than fastqc for checking fastq read quality? Any suggestions would be greatly appreciated. Thank you.

metagenome fastqc shotgun deduplication • 958 views

ADD COMMENT • link updated 6 months ago by colindaven 8.1k • written 6 months ago by vikasmh111 • 0

score 3 · Answer 1 · 2025-05-02

Is fastqc a reliable tool for estimation of duplication of reads in metagenomes?

While FastQC is a reliable tool because of the time/memory constraints it uses sub-sampling of data when estimating some of the parameters it checks. For sequence duplication module only sequences which first appear in the first 100,000 sequences in each file are analyzed.

If you must de-duplicate sequences, you can do so using an alignment-independent tool like clumpify.sh from BBMap suite that works at sequence level. See --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

score 0 · Answer 2 · 2025-05-07

I would be more interested in checking the taxonomic composition of the microbiome before worrying about read duplication. Let's say there are highly dominant bacteria in a couple of samples, this would look like a highly duplicated sample. I think you need to check the sequencing quality relative to the composition of the dataset.

Also different datasets will have different numbers of human read "contamination".