Question: Shotgun metagenomics high duplication read rate - how high is too high?
gravatar for BioinformaticsLad
16 months ago by
BioinformaticsLad160 wrote:

I ran FASTQC on a human gut metagenome sample and found that I have a duplication rate of about 80% Does this seem too high? I checked out some environmental samples and saw approximately the same rates.

I've read papers that recommend de-duplicating reads before analysis because they're most likely PCR artefacts. But I've read papers that recommend keeping all reads since some high abundance species will be sequenced deeply and some reads may be seen more than once. Any thoughts on the matter?

shotgun metagenomics wms • 877 views
ADD COMMENTlink modified 16 months ago by colindaven2.6k • written 16 months ago by BioinformaticsLad160
gravatar for colindaven
16 months ago by
Hannover Medical School
colindaven2.6k wrote:

High read depth on small templates or in your case low-diversity metagenomes does indeed lead to high duplicate rates (see your answer to Josh Herr above). We see this on amplicon datasets all the time.

For your analysis I would attempt analysis - if mapping reads - with and without the duplicates. Remember to add a Mapping Quality filter afterwards too. There will be information in both analysis methods. Remember to check which parts of the genome duplicates are coming from.

Typically, in our pipeline we remove duplicate and low-quality reads before analysis with several tools (mapping to human + bacterial genomes) and after mapping (eg. with Picard).

ADD COMMENTlink written 16 months ago by colindaven2.6k

You bring up a good point about diversity and sequencing depth. I sequenced 30Gbps for each of my gut samples. Gut samples are typically not very high diversity (100-200 species). So could it be that we're simply sequencing too deeply? I'm thinking I should subsample my fastq down to 10Gbps before deduplication.

ADD REPLYlink written 16 months ago by BioinformaticsLad160
gravatar for Vitis
16 months ago by
New York
Vitis2.4k wrote:

Sometimes high duplication rate is a result of excessive PCR enrichments of libraries before sequencing (Illumina platform). I would image metagenomic samples are pretty diverse so that it is less likely you've exhausted all kinds of unique molecules in the samples. I'd suggest you to take a look at the library prep protocol and identify and adjust the PCR cycles used before submitting for sequencing to see whether and how duplication rate would be affected.

ADD COMMENTlink written 16 months ago by Vitis2.4k
gravatar for Josh Herr
16 months ago by
Josh Herr5.7k
University of Nebraska
Josh Herr5.7k wrote:

When you say "duplication" rate, you mean read depth, right?

There are a few options here -- ideally you want high depth for assembly, but if you have too much data you'll max out on your assembly memory. Gut shotgun metagenome data is typically not very complex when compared to soil, so I am surprised you are seeing the same read depth rates? Which publications did you see this in? I'm not surprised about the gut read depth.

To reduce high read depths for assembly, I'm going to point you to the khmer tool. Here's the documentation. (Disclaimer: I worked on this project briefly).

You'll want to map your reads back onto your assembly to establish a rank abundance curve for all the species / strains in your sample.

ADD COMMENTlink written 16 months ago by Josh Herr5.7k

Hi, high read depth won't be a problem for me. I'm referring to sequence duplication - distinct reads that have more than one copy. For example

ADD REPLYlink written 16 months ago by BioinformaticsLad160
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2029 users visited in the last hour