Are GWAS biased by variation in haplotype size?
Entering edit mode
9.2 years ago
Rubal ▴ 350

Hi Everyone,

Just a general question about GWAS studies. Probably this has already been addressed but browsing through the literature I couldn't find any specific mention of this issue. I was wondering if GWAS studies take haplotype length into account as a potentially confounding variable? If you are using SNP data from a population and doing a GWAS to identify regions of the genome contributing to an extremely polygenic trait, aren't SNPS on long haplotypes more likely to show significant associations because, firstly, you are more likely to tag them with a SNP and, secondly, they are more likely to contain multiple causative alleles (especially if you assume a highly polygenic additive model with each variant contributing an equally small amount to a trait)?

Couldn't this result in an enrichment for GWAS hits on young haplotypes or regions that have recently experienced a selective sweep?

If anybody knows any papers that address this I would be grateful to hear about them, or if anyone can explain why this is not an issue. Perhaps this is not directly a bioinformatics question but I wonder if software for GWAS controls for this, or if it needs to.

Best regards,

snp GWAS genome haplotype • 2.7k views
Entering edit mode
9.2 years ago
bmpbowen ▴ 40

Firstly: It depends on how the genotyping array was designed. SNPs for GWAS arrays are usually selected to tag as much variation in LD with a SNP as possible. However other arrays are designed to capture as much known variation from sequence data as possible. The HLA region is a hallmark example a positively selected locus and frequent GWAS hit (just look at the NIHGRI GWAS catalog to see how many GWAS to date have a hit at the HLA region).

Secondly: If one assumes that most polygenic effects act in cis, then maybe. But I wouldn't say that all variants contribute equally to a trait, even under an additive model. Some variants explain a larger proportion of the variation in a trait than others. Regarding multiple signals on one haplotype, people usually mask the effect of one SNP to determine how much it contributes to the overall signal. In cases where there are many SNPs in very strong LD, one can prioritize based on functional data which allele is likeliest to contribute to the phenotype.

This review addresses enrichment of selected loci in GWAS studies:

Nice question.

Entering edit mode

Thanks for the thoughtful reply. On your first point yes it's important to consider the different priorities that go into array design for different studies. HLA is a good example of a positively selected locus that comes up often in GWAS studies (also lots of balancing selection going on there). Although this is probably an example of a positively selected locus that commonly has GWAS hits because it is functionally connected to phenotypes. I was more concerned about 'false positive' GWAS hits, or at least not exactly what the experimenter was looking for hits, that are driven by selective sweeps resulting in a region of high LD that is more likely to be tagged by a SNP and/or contain more causative variants due to its length, relative to other tested SNPs.

I agree with all your points in the second paragraph. I still wonder how much of a problem variation in haplotype length is a problem for GWAS. I suppose that in an ideal world all SNPs would tag small haplotypes of equal size but that given the variation in LD across the genome you just have to accept that this is a factor that influences the power of these studies.

Thanks very much for the review paper you suggested.


Login before adding your answer.

Traffic: 3024 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6