Tutorial: So What Does The Sequence Duplication Rate Really Mean In A Fastqc Report
20
6.9 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

Note: there is an updated version of FastQC see Revisiting the FastQC read duplication report

Recent discussions on the duplication rates reported by FastQC A: Illumina mate pair read duplication level and RNAseq library quantification made me realize that I never thought this through and perhaps this value may not even be what I think it ought to be.

What is really a duplication rate?

I would intuitively define duplication rate as `1 - unique reads/total reads`.

Of course we'll have to define what unique read means. Let's see what FastQC reports. Take this dataset (20Mb) from a bioinformatics lecture series. It contains two sample files, duplications rates reported via FastQC are `11.98%` and `32.09%`.

The FastQC documentation has a section covering the methodology. The plot that it generates is particularly unhelpful as it combines all reads with more than 10 duplications in the last bin, moreover the number of unique reads dwarfs the rest, so usually one can't see much (or to put it differently if you are able to notice anything it usually means really bad data). The duplication values are displayed at the top.

Each of our test files contain 100,000 reads. First we compute a sequence statistics where the occurrence of each read is counted:

``````# create unique sequence counts
\$cat sample1.fq | ~/bin/bioawk -c fastx ' { print substr(\$seq, 1, 50) } ' | sort | uniq -c | sort -rg > seqstats.txt

# the number of unique sequences in the data is then
\$wc -l seqstats.txt
92173 seqstats.txt

# the duplication rate would then be
\$echo '(1- 92173/100000) * 100' | bc -l
7.83
``````

If we take my original definition of uniqueness then the duplication rate is `7.83` but that is different from what FastQC reports `11.98`

It seems FastQC will only consider a read unique if it appears only once:

``````# take reads that only appear once
\$ cat seqstats.txt | awk -F ' ' ' \$1 == 1 { print \$2 }' | wc -l
88023

\$ echo '(1- 88023/100000) * 100' | bc -l
11.97700000000000000000
``````

The same math checks out for the second sample as well:

``````\$ cat sample2.fq | ~/bin/bioawk -c fastx '{print substr(\$seq, 1, 50) }' | sort | uniq -c | sort -rg | awk -F ' ' ' \$1 == 1 { print \$2 }' | wc -l
67913

\$ echo '(1- 67913/100000) * 100' | bc -l
32.08
``````

But this is much larger than the value one would get by looking at the actually unique sequences in the data, that duplication rate would be `17.83%`.

In the end I have come to believe that the FastQC reported duplication rate is not the right representation. Having a sequence duplicated does not actually mean that the "original" sequence should also be removed from computation.

Doing so does not allow us to distinguish between cases where say out of 100K reads one single read would be duplicated 50K times, versus having 25K reads each duplicated 2 times. FastQC would in both cases report a 50% duplication rate whereas a more intuitive definition would report a duplication rate of 50% and 25%

I'll stick to my opinion that duplication should look at all unique reads. But perhaps there is a compelling rationale of why this is not the right approach.

modified 6.2 years ago • written 6.9 years ago by Istvan Albert ♦♦ 84k

@Istvan Albert: Interesting...good to know

Is there a nicer visualization of duplication rate that would be good to show for each sample? Or perhaps there's just one value per sample in a barplot...

3

I have started using a rarefaction plot where the X axis is the reads sequenced and the Y axis is the number of unique reads with the unique reads defined as reads with the same starting position on each end. I don't log it.

We do a lot of tech development where we need to know the complexity of the libraries. I have decided that this ends up being more informative than a rarefaction plot based on genes detected because you estimate the total size of the library if you can see the asymptote. It uses a good chunk of memory but I'll stick it up on github if anyone wants it. I have been trying for a couple weeks to find an equation that fits the line so I can estimate the library size but I haven't been able to do better with my math than with my eyes. Michaelis Menten does not suck but it is not a a good enough fit to justify using it as a metric. Downsampling reads to the point where you have hit saturation seems to improve the performance of over sequenced libraries.

I've also charted scatter plots where the X axis is the total reads aligned to the gene and the Y axis is the total number of duplicate reads aligned to the gene. The points end up going in a diagonal across the chart. A line with a slope of 1 means everything is duplicates. It's a little goofy because it's all in the lower diagonal but you can compare two different libraries that way.

If your rRNA is high it might make sense to remove those reads before putting the library through because they are going to all be duplicates (because they're short with lots of reads by the pigeon hole principle) and then your duplication rate isn't a function of your complexity. It's a function of your rRNA contamination. When we did this paper http://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html we removed the rRNA before calculating duplication. They worked out a lot of the methods before I got there but there are a lot of details that they considered that make a difference in your analysis.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Michele Busby2.1k

I'd be super interested in your script, Michele.

1

Hi Rory,

I put it up on GitHub here. It's C++. There is a compiled version up there.

https://github.com/mbusby/ComplexityByStartPosition

Thanks!

2

"Default is 20. If you want it to go faster use 10. If you want a really smooth line use 50. If you are really neurotic use 100." Ha ha.

Thanks for sharing!

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Madelaine Gogol5.2k

I haven't quite thought about it yet, what might work is a plain counts vs duplication rate plot on log-log scales without the collapsing of reads that are highly duplicated. After all there is a lot of value in seeing where that duplication is coming from.

ADD REPLYlink written 6.9 years ago by Istvan Albert ♦♦ 84k

New post to discuss the latest release of FastQC see Revisiting the FastQC read duplication report

ADD REPLYlink written 6.2 years ago by Istvan Albert ♦♦ 84k
1
6.9 years ago by
Ian5.7k
University of Manchester, UK
Ian5.7k wrote:

Thanks for this. I am sure a lot of people will appreciate the clarification. The author Simon Andrews wrote on his blog that a new version of FASTQC is in the pipeline and that he is looking for novel ways to display duplicated read results.

great, thanks for the link, good to know that there will be changes coming

ADD REPLYlink written 6.9 years ago by Istvan Albert ♦♦ 84k
0
6.9 years ago by
State College, PA, USA
Biomonika (Noolean)3.1k wrote:

This is how I have been dealing with this issue, X-axis represents number of occurrences of the read, Y axis frequency. I then have a script that tries to find out what those duplicated sequences are.

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Biomonika (Noolean)3.1k
1

if you make it a log-log scale, might end up looking like a straight line

ADD REPLYlink written 6.9 years ago by Istvan Albert ♦♦ 84k