I'm using Kneaddata to perform quality control on my fastq files. My samples have contamination from multiple genomes so I passed multiple reference-db in my kneaddata command (used 3 genomes). As a result I got 3 different clean outputs which I concatenated. After running kraken2 I saw that my recovered reads were 200% more than the raw reads.
One possible explanation is that I concatenated clean outputs from three different genomes and it resulted in duplicated reads from non contaminated reads.
Do you have any pointers as to how I should deal with multiple contamination in my raw sample or how to remove duplicate reads.
Thank you so much.
dedupe.sh
from BBMap suite should help there. Something like