The experimental basis of GWAS is genotyping. SNP genotyping enables rapid scanning of .3M, 0.5M or 1M genetic markers (or SNPs) to find genetic variations associated with complex diseases or traits. GWAS deals with a large number of markers and large number of subjects to get reliable signal and associations should be of high significance. For a detailed overview of recent advances in GWAS refer to another discussion here.
How much of the genome is 'captured' in a GWAS with 300k, 500k or 1,000k SNPs?
Human genome encodes 1 SNP/100-300bp; ~3GB sequence ~10million SNPs. It is impossible to analyze such a large number of data due to several limiting factors. To deal with this issue we can use Linkage Disequilibrium (LD) mapping (See section on D', recombination rate), Haplotype, Haplotype blocks and Haplotype Tag SNPs (tagSNPs). (Read about HapMap project here). Instead of genotyping all the 10M SNPs we can genotype tagSNPs in a haplotype block. This is a representative SNP in a given region of genome with high LD. This will enable to find genetic variation without genotyping all the 10M SNPs. Previous studies indicated that genotyping chips with .5M-1M SNPs will be sufficient for a good GWAS.
And where are most of the tagging SNPs located?
Are they mostly in the exome?
No. TaggingSNP selection is not biased towards exome. Most of GWAS hits are in intergenic / promoter or distal regions from exons.
Excellent overview provided by Khader. I'll add a couple points:
Different platforms capture LD SNPs better than others. Illumina is better in this regard, but the new version of from Affy makes up for this deficiency. Size, too, matters - more SNPs give better LD coverage.
Population differences. Some populations will not be as well interrogated by available arrays as other populations. This is so because many polymorphic sites in one population may not be variable in another population, or at so low frequency as not to be included on the array. This is not a huge problem, but can be important for some genomic regions. Think of the extreme: SNPs private to my family are not likely to be on any array because they have not been seen before.
Another way to word your question: Of all LD blocks defined by r^ = 1.0 (or 0.9 or 0.8, etc) and containing n SNPs (where n > 0, or n > 1 or...), how many of those LD blocks are represented on an array? That's a tough question and is dependent on the population under study. We do GWAS and study several different populations and have not put the effort into this calculation. To us, it is not a high priority because we use the platforms and data we have, engage in careful analysis, and report our findings. If a more complete array or analysis comes along later, so be it.
Thank you Larry, and in particular, Khader for the informative responses.
The answer that I am looking for then is, how many of the estimated 10 millions SNPs are captured using each of the aforementioned SNP arrays, say for example in a CEU cohort.