Total Duplicate Percentage Reported In Fastqc
1
2
Entering edit mode
8.7 years ago
henryvuong ▴ 810

I run fastqc on Illumina fastq files from miseq and found a very high level of sequence duplicate as reported elsewhere. Here, I just want to understand how the total duplicate percentage is calculated. The output from fastqc_data.txt is:

Sequence Duplication Levels fail

Total Duplicate Percentage 90.87717882197605 Duplication Level Relative count

1 100.0

2 17.39881642975786

3 8.231572445683474

4 4.841954142913186

5 2.9630989841995876

6 1.9648267630438956

7 1.3296385019239276

8 0.9081272379744089

9 0.6627325615364712

10++ 3.6083033545619205

Where is 90.87717882197605 coming from? Thank in advance.

fastqc fastq quality • 5.2k views
ADD COMMENT
4
Entering edit mode
8.7 years ago

If I recall this correctly those percentages are computed relative to the number of unique reads. See how the first number is 100%. The total duplicate percentage is relative to the total number of reads in the sample.

What this means is that even though the number of reads that are duplicated more than 10 times is only 3% some of these are duplicated at very high rates, tens of thousands of times and thus produce more than 90% of the data.

ADD COMMENT
0
Entering edit mode

Thank you all for the prompt responses.

ADD REPLY
0
Entering edit mode

If you are happy with Istvan's answer, you may choose it as the correct answer. (No obligation, just suggesting in case you have now all the information you were looking for)

ADD REPLY
0
Entering edit mode

Thank, Tony. Yes, I just know the green check now.

ADD REPLY

Login before adding your answer.

Traffic: 1914 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6