How Likely Are Two Sequences To Have The Same Snps By Chance?
1
1
Entering edit mode
10.4 years ago
Nick Stoler ▴ 70

I'm trying to figure out the theoretical chances of variants in two sequences coinciding (in the same position) by chance.

Right now I'm just thinking about the simple case of two aligned sequences of length n (say, 200bp), with a known SNP rate of 0.001 per bp. It's interesting because it seems like a variant of the birthday paradox, but applying the analogy to the case of nucleotides and SNPs is less straightforward than I expected. I also could be wrong about the parallel.

First I'm just concentrating on figuring out the chances of there being a shared SNP at all, regardless of the base being the same. It seems I might be to calculate the probability of there not being a shared SNP pretty easily. Looking at any pair of aligned single nucleotides, the chances of them both being SNPs should be 0.001 * 0.001 = 0.000001. And the chance of them not both being a SNP is then 0.999999. So am I able to then say the chances of there not being a single shared SNP among all the n nucleotides is 0.999999^n?

Edit: I should make clear that I know this is loaded with assumptions that simply aren't true in reality, such as evenly distributed SNPs, unrelatedness of the individuals, etc. Which is why the usefulness of even calculating it is up for debate, but I'm trying to get a sense of the mathematical relationship, all things being equal, between SNP frequency, sequence length, and coincidental SNPs. This is, of course, the null hypothesis, whereas the alternative hypothesis is that the shared SNPs are due to homology.

snp homology • 3.2k views
1
Entering edit mode

Note also that the human reference genome itself contains rare SNPs. As a result, at these loci the probability that any two unrelated individuals will have the same non-reference base is very high (since the reference base is the rare allele, and the individuals simply have the common allele).

0
Entering edit mode

It only becomes a variant of the birthday paradox if you are asking if in a pool of sequences, there are two that share a SNP.

0
Entering edit mode
10.4 years ago
lh3 33k

If you are talking about shared alleles between individuals from one population - most shared alleles between individuals are not because two independent mutations occurred twice in history, but because the local haplotypes descended from the same ancestor where the mutation first occurred. In addition, 0.001 is the typical heterozygosity of an African, or equivalently the fraction of two chromosomes differ. Strictly speaking, it is not "SNP rate". Furthermore, due to coalescence, the local heterozygosity varies up to several orders of magnitude. Mutations also tend to occur at a higher frequency in some sequence context than others. CpG mutations are a good example.

0
Entering edit mode

All very good points. I edited the question to acknowledge the explicitly unrealistic assumptions.

0
Entering edit mode

0
Entering edit mode

Sorry, my thinking actually led me to an answer approximating 0.001^2 for short sequence lengths. To be exact, the probability of a coincidental shared SNP I got is 1-(1-0.001^2)^n where n = sequence length.
Is it actually 0.001, regardless of length?