Is there a good resource for identifying roughly how many reference alleles in hg19 are the minor allele? By fraction? On an Affymetrix array, after filtering, I am left with ~550k SNPs of which 77k are flagged as MAF > 0.5 by checkVCF. That seems high to me, but then again, I don't really have a frame of reference.
Well, the truth on this matter may be surprising but hg19 / GRCh37 contains over 100,000 minor alleles at a MAF < 0.01. It also contains many 1000s of known disease risk alleles. This is the case because 70% of this 'reference' genome is based on a single individual from Buffalo, New York, USA. As we are all aware, none of us are completely healthy and we each carry 1000s of alleles that augment our susceptibility to various diseases.
This of course makes the work of clinical geneticists very difficult. In many scenarios, our allele of interest may actually already be in the very reference genome against which we are re-aligning our reads. This can cause confusion to variant callers and annotation programs, and, without proper investigation, it may appear that those who don't have the disease allele do in fact have it, and vice versa. A good example of this is Factor V Leiden, a variant that increases risk of deep vein thrombosis. The hg19 dude had this risk allele.
Even with updates and patches to the reference genome, the same problem persists. When you think about it, there really is no way to have a consensus reference genome, or at best we would have to have a separate reference genome for each ethnic group across the globe.
It's just something that you need to keep in the back of your head.
For further reading:
- THE REFERENCE HUMAN GENOME DEMONSTRATES HIGH RISK OF TYPE 1 DIABETES AND OTHER DISORDERS
- Alternate nucleotide is more frequent than reference nucleotide. OMG I'm dizzy. How do I stop the twirl?