high duplication events in shotgun metagenomics
1
0
Entering edit mode
2.8 years ago
alexis ▴ 10

Hi all,

I am working on a project that aims to recover Metagenome assembled genomes using shotgun metagenomics. During the quality analysis using FastQC I found high duplication levels (percent of sequences remaining if deduplicated is 27%). However, from previous amplicon metagenomics, I know that I am working with a low diversity sample in which one taxon has a high relative abundance (>70%).

I've read in some papers that duplication can harm assembler performance (both in computational cost and assembly quality). I know that everyone has a different take on deduplication, but I still wanted to ask for some recommendations on this.

Also, since I am relatively new in bioinformatics, I would find really helpful if you could share any preferred workflows or tools that you use for deduplication.

MAG metagenomics duplication • 1.3k views
ADD COMMENT
3
Entering edit mode
2.8 years ago
Mensur Dlakic ★ 27k

I think you have such a high duplication rate precisely because your sample is of low diversity. The most abundant species probably ends up having its coverage in thousands. It probably wouldn't hurt to assemble with deduplicated data, even though it may feel to you like you are throwing away 73% of your reads. Most likely this will end up helping rather than hurting.

Alternatively, you may want to try to non-randomly downsample your reads to 80x or 100x. I usually do that with khmer by running its scripts load-into-counting.py, normalize-by-median.py, filter-abund.py and extract-paired-reads.py (in that order). What they do is described here. If you start browsing their pages, I think their name for the procedure is digital normalization.

A similar downsampling can be achieved using the bbnorm.sh script from BBTools.

ADD COMMENT
1
Entering edit mode

Mensur, thank you so much for your recommendations, As you suggested I followed the documentation of khmer and found a really useful tutorial that I'll leave here, for anyone who is interested. This is the paper related to the methods they use.

ADD REPLY
0
Entering edit mode

I will add that "throwing away" the reads will feel wrong to most people, so you may like to assemble once with all the data just so you have a baseline. If your most abundant species ends up having a 1000+x coverage, the non-random read errors will almost certainly cause the fragmentation, and you will probably end up with lots of relatively short contigs. It may feel counterintuitive that one can get a better assembly by thinning down the redundant data, but that often ends up being the case.

ADD REPLY

Login before adding your answer.

Traffic: 3098 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6