Dealing with Multiallelic in GWAS
2
1
Entering edit mode
2.2 years ago
godth13teen ▴ 70

Hi, I'm quite new to GWAS, based on my understanding so far, I have some questions.

Thank you for answering my question!

SNP GWAS • 1.3k views
2
Entering edit mode
2.2 years ago

You include n-1 genotype columns in your regression, where n is the number of alleles. (One allele, usually the highest-frequency one, must be omitted to avoid linear dependence in the regression.)

0
Entering edit mode

1
Entering edit mode

Suppose you have 4 samples; let's label them A, B, C, and D. Sample A has genotype T/T at this SNP, and phenotype value 175. Sample B has genotype C/T and phenotype value 160; sample C has genotype C/C and phenotype value 155; and sample D has genotype T/T and phenotype value 173.

A standard GWAS is based on [phenotype] ~ [genotype, intercept, other predictors] regressions. Ignoring "other predictors" for now, the data matrices for the regression at this SNP would look like

phenotype        intercept  #C
175                1   0
160                1   1
155                1   2
173                1   0


I've labeled the single genotype column "#C" here, representing "number of copies of the C allele".

Now change sample D's genotype to A/T. This would leave the original data matrices unchanged: neither A/T nor T/T have any copies of C. Which may actually be fine for detecting whether the C allele has a noticeable effect, but we're now also interested in whether the A allele does. We investigate that by adding a #A column:

phenotype        intercept  #A  #C
175                1   0   0
160                1   0   1
155                1   0   2
173                1   1   0


Of course, with only 4 samples, we can't conclude much. But (with a good choice of "other predictors") this approach becomes quite effective as your sample size increases.

0
Entering edit mode

Ah, it's clear to me now, thank you

1
Entering edit mode
2.2 years ago
Asaf 8.9k
1. The model is usually linear so 0,1,2 is the number of minor alleles in the genome (so 0=homo-major, 1=hetero, 2=homo-minor) and the assumption is that two minor alleles will have two times the effect of the major. It doesn't have to hold for every test and tool but this is what I've seen. If there are alternative minor alleles they could be two different SNPs or assumed to have the same effect (or avoided altogether).
2. One way of dealing with epistasis could be to multiply the two SNPs values and divide by 2 (to be in the 0-2 range). I don't know a tool that can do this but statistically is should be valid (assuming linear interaction and additive effect).
0
Entering edit mode