Having written my comment, I realized that probably the most straight-forward approach to estimate the occurrence of
novel variants in a dataset of a certain size might be to subsample the 1kG data. Note that this isn't an estimate for the
mutation rate unless we have a method to single out point mutations. Earlier, SNPs have been defined as variant with a
MAF of at least some threshold (e.g. 1%), I am not sure how much sense such an arbitrary threshold makes in this case.
Also, a sampling approach might not be practicable because of the
computational costs and sensitivity to the variant calling pipeline and its parameters.
Also, when looking for certain estimates of mutation-rates:
The human mutation rate is higher in the male germ line (sperm) than
the female (egg cells), but estimates of the exact rate have varied by
an order of magnitude or more. [...]
Using data available from whole genome sequencing, the human genome
mutation rate is similarly estimated to be ~1.1×10−8 per site per
generation.
http://en.wikipedia.org/wiki/Mutation_rates
Using this probability as p (or any other estimate) for a single event, the probability of observing n = 3 or more mutations (successes) in
k trials (k := number of exonic bases in gene) using the cummulative distribution function of the Binomial distribution. In R you can use
the function pbinom(n, k, p, lower.tail = FALSE, log.p = FALSE)
to calculate this probability. Given the CDS of human FANCA is 4368 nt, and assuming
the highest mutation rate I found of 2.7e-8 this yields: 8.048903e-18 which looks significant, but depends on the purity of the sequences, if your variants are sampled from a mixture, this naive calculation is void.
This relies on the assumption that the mutation events are independent of each other. This should be justified for real mutations, but not for their
accumulation because of varying levels of purifying selection on certain regions. As you probably compared somatic with germ-line cells, the mutations might be real point mutations, and there should not be a significant accumulation of mutation rates in any given region (null-hypothesis).
I am sorry that I cannot answer your question directly. Approaching this problem the way you describe raises an immediate follow up question for me: Given a set of rare variants from data without also giving a family pedigree, how can one differentiate mutations from SNPs? With your patient data you might have looked for novel variants (e.g. not in dbSNP or in 1kG). That might introduce circular inference, because according to your definition of 'likely a mutation' your new variants are disjoint from the 1kG. So you could as well have picked up some variation with a MAF that was below the detection limit of the 1kG data. To make such observation is maybe not that striking, given that for every new genome analyzed, also novel variants will be called. At least a few, how many could be estimated by simulation, e.g. taking 1 genome out of the 1kG data, re-calling the variants for 999 genomes, then calling variants for the 1 taken out.
To find out whether or not variants are associated with the phenotype it might be more appropriate to re-frame your problem in the setting of standard association testing, as this also includes the phenotype, while the sole testig for deviation from mutation rate does by design not take into account any phenotype and thus cannot be interpreted to deliver any information about phenotype-genotype association.
Also, you need to explain how a germ-line mutation can be linked to a somatic phenotype? Or are you looking at a germ related phenotype?
Your phrase is incomplete. "To ask if the 3/50 mutations is statistically more than expected" -> more what than expected??