I am writing a Python program to simulate the cancer genome with different events such as deletion, amplification, copy neutral LOH, ...
The simulator has two parts: simulation of the read depth (this part is already implemented) and simulation of beta allele fraction (BAF) (under development).
To develop the second part, I am going to use the binomial distribution,
binom(n, p), where parameter 'n' is the expected number of reads that overlap a snp and parameter 'p' is calculated based on the events. For instance, if there is no aberration, I expect p to be 0.5 (0.5 for seeing the reference allele and 0.5 for seeing the alternative (b) allele) etc. The output of the binomial distribution is then the number of b-allele at each position.
My question is about estimating the parameter 'n' of the binomial distribution from the simulated read depth track (part one of the program). For instance, if we assume that the resolution of the read depth track is 1000 bp (which is the same as step in a .wig file with fixed step) and the read_length is 150 bp and if we see 400 reads (in one window of length 1000 bp.), how many of those reads overlap one [heterozygous] snp in that window?
I would like to thank you in advance for your suggestions on how to approach this question. Any idea is much appreciated.
Here is how two tracks of real data look like. First is the read depth and the second is the mirrored BAF track:
- binomial distribution is used to add some noise to the data.