- How did it come to be that the alternate nucleotide was more frequent than the reference nucleotide?
- How does one account for this phenomenon when designing a strategy to filter for variants of interest? Should I go through the complicated process of selecting those individuals who DO NOT have the variant and calculate that the REFERENCE frequency in the population is probably around (1 - esp6500siv_all)?
I am researching a rare disease and have whole exome sequence data with the corresponding variant calls. Each variant call has been passed to annovar and among other data, we have looked up the frequency of the variant in the esp6500siv2_all data. Clearly a variant that was observed to have a high frequency in our sample but that had low frequency in esp6500siv2_all would be of disproportionate interest.
Low and behold I was surprised to find that 13% of the all of our variants (4055 out of 32131) had an allele frequency that was greater than 0.5. How can that be? I expected that all the allele frequencies would be <0.5.
I had thought that the variants would be akin to a minor allele frequency (MAF). Clearly I was wrong. I pulled 3 random variants from among the variants that had more than 0.5 frequency, to check them against the Exome Variant Server.
avsnp147 Chr Start End Ref Alt Gene.refGene esp6500siv2_all 1: rs3803530 15 89632842 89632842 C A KIF7 0.5373 2: rs621383 3 125118840 125118840 T C SLC12A8 0.9988 3: rs633561 11 64229857 64229857 A G NUDT22 0.9418
Looking up at NHLBI Exome Sequencing Project (ESP) Exome Variant Server and using All Allele
- rs3803530: C>A; A=6984/C=6014 which means A is 6984/(6984+6014) or 0.537
- rs621383: T>C; C=12479/T=15 which means C is 12479/(12479+15) or 0.999
- rs633561: A>G; G=12240/A=756 which means G is 12240/(12240+756) or 0.941