Question

Expected number of read duplicates

0

Entering edit mode

9.3 years ago

dariober 14k

This is a long post, hopefully you can follow me to the end...

Suppose you map N paired end reads on a genome of size G.

Q: What is the expected number of duplicate reads exclusively due to chance?

Let's assume there are no complications: The genome is made of a single chromosome, fully mappable, all the positions have equal probability of being mapped, reads are small compared to the genome size. There are no PCR duplicates, no technical artifacts.

I think I managed to figure out the solution for 1) single end reads and 2) paired end reads where fragment sizes between a and b have equal probability of being sampled.
But I would like to get a general solution, specifically where fragment size follow a Poisson distribution.

Expected level of read duplication for single end reads

For SE reads, two reads are duplicates if they share the same 5'end position on the same strand.
In this case the expected number of duplicates comes from, in R language:

exp_dup<- N - (sum(dpois(1:n, N/G/2) * G) * 2)

Where dpois is the probability mass function for the Poisson distribution (appropriate given the assumptions above.)

Explanation:

dpois(1:n, N/G/2) is the probability of a position to be covered 1, 2, ... n times on the same strand (for n -> Inf).
dpois(1:n, N/G/2) * G is the expected number of positions covered exactly 1, 2, ... n times on the same strand.
sum(...) * G is total number of positions with non-zero coverage. Since a position can be covered just once without incurring in duplication this is the expected number of non-duplicates. Multiply by 2 for forward and reverse strand. N minus this product is the number of duplicates.

For example, we sequence 10000 reads on a genome of size 10000. We expect ~576 duplicates:

G<- 10000
N<- 5000
N - (sum(dpois(1:10, (N/2) / G) * G) * 2)
[1] 576.0157

Expected level of read duplication for paired end reads

PE read pairs are duplicates if they share the same 5'end positions for both mate 1 and mate 2.

Let's assume that the fragment size has a min length a and max length b. The lengths between
a and b are equally likely to occur. In this case, we need to "augment" the genome
size to account for the different fragment lengths:

G2<- G + G * (b - a)
exp_dup_pe<- N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2)

For example, if the possible fragment sizes are only 100 and 110 bp, we use the formula
for SE reads with a genome 10 times as large.

But if the fragment sizes are not equally likely to occur, as they don't since they follow a Poisson-ish distribution, how can I calculate the expected number of read duplicates?

===
This is a script to simulate random PE reads.

library(data.table)
G<- 10000
N<- 20000
a<- 1 # Min fragment length
b<- 10 # Max fragment length
L<- 15 # Mean fragment length
n<- 10 # A number sufficiently large (ideally oo)

set.seed(1)
pos_fp1<- sample(1:G, size= N, replace= TRUE)

set.seed(2)
# pos_fp2<- pos_fp1 + rpois(n= N, lambda= L)
pos_fp2<- pos_fp1 + sample(a:b, size= N, replace= TRUE)
pos_fp2<- ifelse(pos_fp2 > G, G, pos_fp2)

set.seed(3)
strand<- sample(c('-', '+'), size= N, replace= TRUE)
reads<- data.table(pos_fp1, pos_fp2, strand)[order(pos_fp1, pos_fp2),]  

## Observed number of reads for SE:
## ================================
depth_se<- reads[, list(depth= .N), by= list(pos_fp1, strand)][order(depth)]
dups<- table(depth_se$depth)
dups<- data.table(depth= as.numeric(names(dups)), n_pos= dups)
dups$nreads<- dups$depth * dups$n_pos
dups$ndups<- (dups$depth - 1) * dups$n_pos
sum(dups$ndups)
7373

## Analytical way SE
## =================
N - (sum(dpois(1:n, N/G/2) * G) * 2)
7357.589

## Observed number of reads for PE:
## ================================
depth_pe<- reads[, list(depth= .N), by= list(pos_fp1, pos_fp2, strand)][order(depth)]
N - nrow(depth_pe)
972

## Analytical way PE
## =================
G2<- G + G * (b - a)
N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2)
967.4836

read duplication statistics • 3.3k views

ADD COMMENT • link updated 9.3 years ago by Devon Ryan 104k • written 9.3 years ago by dariober 14k

score 0 · Answer 1 · 2015-01-11

Assuming your math for the single-end case is correct, then the paired-end case is based off of that. Given two PE alignments where read #1 of each aligns to the same position (the probability of this you already determined), the probability that the two alignments will be marked as a duplicate is simply the probability that the insert sizes are the same. I'm sure there some nice equation for this, but I'm lazy and this is easy to simply calculate in R:

sum(dpois(1:1000, 500)^2)

Here, 500 is the mean fragment length. The upper limit is sufficient for most common libraries and is just a time saver. In practice, this means you expect 1-2% of what you'd get in an SE dataset.