Question: Estimating the expected number of reads that overlap a [heterozygous] snp in one window from the read depth, read length and resolution
1
4.8 years ago by
Dataman260
Finland
Dataman260 wrote:

Hi,

I am writing a Python program to simulate the cancer genome with different events such as deletion, amplification, copy neutral LOH, ...

The simulator has two parts: simulation of the read depth (this part is already implemented) and simulation of beta allele fraction (BAF) (under development).

To develop the second part, I am going to use the binomial distribution, `binom(n, p)`, where parameter 'n' is the expected number of reads that overlap a snp and  parameter 'p' is calculated based on the events. For instance, if there is no aberration, I expect p to be 0.5 (0.5 for seeing the reference allele and 0.5 for seeing the alternative (b) allele) etc. The output of the binomial distribution is then the number of b-allele at each position.

My question is about estimating the parameter 'n' of the binomial distribution from the simulated read depth track (part one of the program). For instance, if we assume that the resolution of the read depth track is 1000 bp (which is the same as step in a .wig file with fixed step) and the read_length is 150 bp and if we see 400 reads (in one window of length 1000 bp.), how many of those reads overlap one [heterozygous] snp in that window?

I would like to thank you in advance for your suggestions on how to approach this question. Any idea is much appreciated.

Here is how two tracks of real data look like. First is the read depth and the second is the mirrored BAF track:

* binomial distribution is used to add some noise to the data.

modified 4.7 years ago • written 4.8 years ago by Dataman260
1
4.7 years ago by
Dataman260
Finland
Dataman260 wrote:

For instance if we have:
resolution = 1000
read_depth at the current window = 400

Then:

n = 100 * (400 / 1100) ≈ 40

Here is a snapshot of the results of the simulation: