Deduplication rate of shotgun metagenomes using fastqc
2
0
Entering edit mode
4 months ago
vikasmh111 • 0

Hello, I am doing a shotgun metagenome analysis of gut sample. I got shotgun metagenome data generated on Illumina platform. Fastqc report on fastq file indicated my metagenome has 52% duplicacy. I processed the file with bowtie2 for human read removal, deduplication using fastp (using accuracy 6) and trimmed using trimmomatic. Lost about 15% of the reads in processing. Fastqc estimated presence of 51.5% duplicacy rate for the processed file, while fastp and HTS_super_deduper estimated presence of less than 0.5% duplicate reads on the processed files (for the unprocessed file it was 9% and 12%). Is fastqc a reliable tool for estimation of duplication of reads in metagenomes? Is there any recommended tool other than fastqc for checking fastq read quality? Any suggestions would be greatly appreciated. Thank you.

metagenome fastqc shotgun deduplication • 810 views
ADD COMMENT
3
Entering edit mode
4 months ago
GenoMax 153k

Is fastqc a reliable tool for estimation of duplication of reads in metagenomes?

While FastQC is a reliable tool because of the time/memory constraints it uses sub-sampling of data when estimating some of the parameters it checks. For sequence duplication module only sequences which first appear in the first 100,000 sequences in each file are analyzed.

If you must de-duplicate sequences, you can do so using an alignment-independent tool like clumpify.sh from BBMap suite that works at sequence level. See --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

ADD COMMENT
0
Entering edit mode

Thank you, will try and check.

ADD REPLY
0
Entering edit mode
4 months ago

I would be more interested in checking the taxonomic composition of the microbiome before worrying about read duplication. Let's say there are highly dominant bacteria in a couple of samples, this would look like a highly duplicated sample. I think you need to check the sequencing quality relative to the composition of the dataset.

Also different datasets will have different numbers of human read "contamination".

ADD COMMENT

Login before adding your answer.

Traffic: 5053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6