This is a long post, hopefully you can follow me to the end...
Suppose you map N paired end reads on a genome of size G.
Q: What is the expected number of duplicate reads exclusively due to chance?
Let's assume there are no complications: The genome is made of a single chromosome, fully mappable, all the positions have equal probability of being mapped, reads are small compared to the genome size. There are no PCR duplicates, no technical artifacts.
I think I managed to figure out the solution for 1) single end reads and 2) paired end reads where fragment sizes between a and b have equal probability of being sampled.
But I would like to get a general solution, specifically where fragment size follow a Poisson distribution.
Expected level of read duplication for single end reads
For SE reads, two reads are duplicates if they share the same 5'end position on the same strand.
In this case the expected number of duplicates comes from, in R language:
exp_dup<- N - (sum(dpois(1:n, N/G/2) * G) * 2)
Where dpois is the probability mass function for the Poisson distribution (appropriate given the assumptions above.)
- dpois(1:n, N/G/2) is the probability of a position to be covered 1, 2, ... n times on the same strand (for n -> Inf).
- dpois(1:n, N/G/2) * G is the expected number of positions covered exactly 1, 2, ... n times on the same strand.
- sum(...) * G is total number of positions with non-zero coverage. Since a position can be covered just once without incurring in duplication this is the expected number of non-duplicates. Multiply by 2 for forward and reverse strand. N minus this product is the number of duplicates.
For example, we sequence 10000 reads on a genome of size 10000. We expect ~576 duplicates:
G<- 10000 N<- 5000 N - (sum(dpois(1:10, (N/2) / G) * G) * 2)  576.0157
Expected level of read duplication for paired end reads
PE read pairs are duplicates if they share the same 5'end positions for both mate 1 and mate 2.
Let's assume that the fragment size has a min length a and max length b. The lengths between
a and b are equally likely to occur. In this case, we need to "augment" the genome
size to account for the different fragment lengths:
G2<- G + G * (b - a) exp_dup_pe<- N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2)
For example, if the possible fragment sizes are only 100 and 110 bp, we use the formula
for SE reads with a genome 10 times as large.
But if the fragment sizes are not equally likely to occur, as they don't since they follow a Poisson-ish distribution, how can I calculate the expected number of read duplicates?
This is a script to simulate random PE reads.
library(data.table) G<- 10000 N<- 20000 a<- 1 # Min fragment length b<- 10 # Max fragment length L<- 15 # Mean fragment length n<- 10 # A number sufficiently large (ideally oo) set.seed(1) pos_fp1<- sample(1:G, size= N, replace= TRUE) set.seed(2) # pos_fp2<- pos_fp1 + rpois(n= N, lambda= L) pos_fp2<- pos_fp1 + sample(a:b, size= N, replace= TRUE) pos_fp2<- ifelse(pos_fp2 > G, G, pos_fp2) set.seed(3) strand<- sample(c('-', '+'), size= N, replace= TRUE) reads<- data.table(pos_fp1, pos_fp2, strand)[order(pos_fp1, pos_fp2),] ## Observed number of reads for SE: ## ================================ depth_se<- reads[, list(depth= .N), by= list(pos_fp1, strand)][order(depth)] dups<- table(depth_se$depth) dups<- data.table(depth= as.numeric(names(dups)), n_pos= dups) dups$nreads<- dups$depth * dups$n_pos dups$ndups<- (dups$depth - 1) * dups$n_pos sum(dups$ndups) 7373 ## Analytical way SE ## ================= N - (sum(dpois(1:n, N/G/2) * G) * 2) 7357.589 ## Observed number of reads for PE: ## ================================ depth_pe<- reads[, list(depth= .N), by= list(pos_fp1, pos_fp2, strand)][order(depth)] N - nrow(depth_pe) 972 ## Analytical way PE ## ================= G2<- G + G * (b - a) N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2) 967.4836