Hi all,
I am working on a project that aims to recover Metagenome assembled genomes using shotgun metagenomics. During the quality analysis using FastQC I found high duplication levels (percent of sequences remaining if deduplicated is 27%). However, from previous amplicon metagenomics, I know that I am working with a low diversity sample in which one taxon has a high relative abundance (>70%).
I've read in some papers that duplication can harm assembler performance (both in computational cost and assembly quality). I know that everyone has a different take on deduplication, but I still wanted to ask for some recommendations on this.
Also, since I am relatively new in bioinformatics, I would find really helpful if you could share any preferred workflows or tools that you use for deduplication.
Mensur, thank you so much for your recommendations, As you suggested I followed the documentation of khmer and found a really useful tutorial that I'll leave here, for anyone who is interested. This is the paper related to the methods they use.
I will add that "throwing away" the reads will feel wrong to most people, so you may like to assemble once with all the data just so you have a baseline. If your most abundant species ends up having a 1000+x coverage, the non-random read errors will almost certainly cause the fragmentation, and you will probably end up with lots of relatively short contigs. It may feel counterintuitive that one can get a better assembly by thinning down the redundant data, but that often ends up being the case.