With a new release of FastQC the post titled So What Does The Sequence Duplication Rate Really Mean In A Fastqc Report has lost its relevance. This is a followup and a short discussion of the new plots and their interpretation.
The new plots now contain two different curves and the meaning of the percentage has also changed. The explanations in the docs are little bit lacking to make sure I got it right I wrote a python implementation (see the end) that produces the same plots.
I found it helpful to use the term "distinct" sequences rather than unique sequences as this latter term seems to imply to some that those sequences are present only once in the data. So distinct sequences are defined as the largest subset of sequences where no two sequences are identical.
Thus distinct sequences = number of singletons (sequences that appear only once) + number of doubles (number of sequences that appear twice but each double will be counted only once) + number of triplets (sequences that appear three times but each will be counted once) ... and so on.
The percentage in the title is computed as the
distinct/total * 100
The blue line represents the counts of all the sequences that are duplicated at a given rate. The percentage is computed relative to the total number of reads.
The red line represents the number of distinct sequences that are duplicated at a given rate. The percentage is computed relative to the total number of distinct sequences in the data.
Let's take two examples where each contain 20 reads:
- Case 1: 10 unique reads + 5 reads each present twice (duplicates)
- Case 2: 10 unique reads + 1 read present 10 times
Case 1 shown in the upper plot will lead to 15 distinct reads and thus 15/20=75% percent remaining, the number of singletons is 1x10 =10 and the number of doubles is 5x2 =10 therefore the blue line has a plateau at those rates. The 15 distinct sequences are distributed as 10 singletons and 5 duplicates, 10/15=66% and 5/15=33% is the slope of the red line.
Case 2 will produce 11 distinct reads and therefore 11/20=55% will be the precent remaining reads. Again the total number of reads is equally distributed between the two cases but this time the peak will be at 10 since we have one read duplicated 10 times and that produces 10 sequences. But there are 11 total groups where 10/11=91% are singletons and 1/11=9% of the groups form at duplication rate of 10x.
Below is the python code that was used to plot the above.