Question

Total Duplicate Percentage Reported In Fastqc

2

Entering edit mode

11.2 years ago

henryvuong ▴ 810

I run fastqc on Illumina fastq files from miseq and found a very high level of sequence duplicate as reported elsewhere. Here, I just want to understand how the total duplicate percentage is calculated. The output from fastqc_data.txt is:

Sequence Duplication Levels fail
Total Duplicate Percentage 90.87717882197605 Duplication Level Relative count
1 100.0
2 17.39881642975786
3 8.231572445683474
4 4.841954142913186
5 2.9630989841995876
6 1.9648267630438956
7 1.3296385019239276
8 0.9081272379744089
9 0.6627325615364712
10++ 3.6083033545619205

Where is 90.87717882197605 coming from? Thank in advance.

fastqc fastq quality • 6.4k views

ADD COMMENT • link updated 11.2 years ago by toni ★ 2.2k • written 11.2 years ago by henryvuong ▴ 810

score 4 · Answer 1 · 2013-02-24

4

Entering edit mode

11.2 years ago

Istvan Albert 100k

If I recall this correctly those percentages are computed relative to the number of unique reads. See how the first number is 100%. The total duplicate percentage is relative to the total number of reads in the sample.

What this means is that even though the number of reads that are duplicated more than 10 times is only 3% some of these are duplicated at very high rates, tens of thousands of times and thus produce more than 90% of the data.

ADD COMMENT • link 11.2 years ago by Istvan Albert 100k

0

Entering edit mode

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Duplicate%20Sequences.html

ADD REPLY • link 11.2 years ago by toni ★ 2.2k

0

Entering edit mode

Thank you all for the prompt responses.

ADD REPLY • link 11.2 years ago by henryvuong ▴ 810

0

Entering edit mode

If you are happy with Istvan's answer, you may choose it as the correct answer. (No obligation, just suggesting in case you have now all the information you were looking for)