5.0 years ago by
Having written my comment, I realized that probably the most straight-forward approach to estimate the occurrence of
novel variants in a dataset of a certain size might be to subsample the 1kG data. Note that this isn't an estimate for the
mutation rate unless we have a method to single out point mutations. Earlier, SNPs have been defined as variant with a
MAF of at least some threshold (e.g. 1%), I am not sure how much sense such an arbitrary threshold makes in this case.
Also, a sampling approach might not be practicable because of the
computational costs and sensitivity to the variant calling pipeline and its parameters.
Also, when looking for certain estimates of mutation-rates:
The human mutation rate is higher in the male germ line (sperm) than
the female (egg cells), but estimates of the exact rate have varied by
an order of magnitude or more. [...]
Using data available from whole genome sequencing, the human genome
mutation rate is similarly estimated to be ~1.1×10−8 per site per
Using this probability as p (or any other estimate) for a single event, the probability of observing n = 3 or more mutations (successes) in
k trials (k := number of exonic bases in gene) using the cummulative distribution function of the Binomial distribution. In R you can use
pbinom(n, k, p, lower.tail = FALSE, log.p = FALSE) to calculate this probability. Given the CDS of human FANCA is 4368 nt, and assuming
the highest mutation rate I found of 2.7e-8 this yields: 8.048903e-18 which looks significant, but depends on the purity of the sequences, if your variants are sampled from a mixture, this naive calculation is void.
This relies on the assumption that the mutation events are independent of each other. This should be justified for real mutations, but not for their
accumulation because of varying levels of purifying selection on certain regions. As you probably compared somatic with germ-line cells, the mutations might be real point mutations, and there should not be a significant accumulation of mutation rates in any given region (null-hypothesis).