high duplication events in shotgun metagenomics
1
0
Entering edit mode
12 months ago
alexis ▴ 10

Hi all,

I am working on a project that aims to recover Metagenome assembled genomes using shotgun metagenomics. During the quality analysis using FastQC I found high duplication levels (percent of sequences remaining if deduplicated is 27%). However, from previous amplicon metagenomics, I know that I am working with a low diversity sample in which one taxon has a high relative abundance (>70%).

I've read in some papers that duplication can harm assembler performance (both in computational cost and assembly quality). I know that everyone has a different take on deduplication, but I still wanted to ask for some recommendations on this.

Also, since I am relatively new in bioinformatics, I would find really helpful if you could share any preferred workflows or tools that you use for deduplication.

MAG metagenomics duplication • 587 views
3
Entering edit mode
12 months ago
Mensur Dlakic ★ 19k

I think you have such a high duplication rate precisely because your sample is of low diversity. The most abundant species probably ends up having its coverage in thousands. It probably wouldn't hurt to assemble with deduplicated data, even though it may feel to you like you are throwing away 73% of your reads. Most likely this will end up helping rather than hurting.

Alternatively, you may want to try to non-randomly downsample your reads to 80x or 100x. I usually do that with khmer by running its scripts load-into-counting.py, normalize-by-median.py, filter-abund.py and extract-paired-reads.py (in that order). What they do is described here. If you start browsing their pages, I think their name for the procedure is digital normalization.

A similar downsampling can be achieved using the bbnorm.sh script from BBTools.

1
Entering edit mode

Mensur, thank you so much for your recommendations, As you suggested I followed the documentation of khmer and found a really useful tutorial that I'll leave here, for anyone who is interested. This is the paper related to the methods they use.

0
Entering edit mode

I will add that "throwing away" the reads will feel wrong to most people, so you may like to assemble once with all the data just so you have a baseline. If your most abundant species ends up having a 1000+x coverage, the non-random read errors will almost certainly cause the fragmentation, and you will probably end up with lots of relatively short contigs. It may feel counterintuitive that one can get a better assembly by thinning down the redundant data, but that often ends up being the case.