Question

Handling high duplication levels in fastqc report of raw illumina data from metagenomics

0

Entering edit mode

9.6 years ago

bioinfo ▴ 840

I have read a few threads here asked by others about the short read duplication levels in raw data produced by sequencing platforms. I also have read Istvan's tutorial on this issue. However, I didn't find an easy striagt forward solution on "what to do then if one have high duplication levels in the filtered WGS Metagenomics data?". This time, I have run the trim-galore to filter my raw dataset and then looked at the fastqc report on my short dataset and it reports that I have almost 79.5% duplication (based on old v0.10.0 version of fastqc). Other reports from fastqc look OK. Some say when abundance is a goal then we just have to ignore the duplication level result from fastqc as same read can come from multiple organisms in a metagenome set. I know for RNAseq many people totally ignore duplication level result from fastqc and move on to downstream analysis. As I'm interested in abundance of certain genes in a metagenome, should I also ignore duplication level result though it sems quite high (79.5%) and move on to abundance analysis of my interested genes? How do you guys handle the high duplication level issue? Do you remove the duplicate (i.e not unique or distinct) sequences?

fastqc metageomics illumina • 3.5k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.6 years ago by bioinfo ▴ 840

Ram · Answer 1 · 2015-12-12

Sequence duplication is either natural or artificial. Natural duplication will manifest itself when the coverage of a region is very high, though I don't have a good sense how the math works out:i.e. what percent of duplication is to be expected for N reads of length L.

What is important to establish is whether the sequences that show high duplication rates cover entire genomes (or happen to be present in many genomes) . Find the sequences that seem to duplicate at high levels and see if those all align to very few (and same) targets.