Question: How to calculate the increase in qualitative disease score per risk allele?
gravatar for analyticalavailable
5.5 years ago by
analyticalavailable30 wrote:


I'd like to calculate the increase in qualitative disease score per risk allele. I've a cohort of 1000, and a single SNP, for each of the individuals, I've a score from 0 to 100. It was recommended that I use a linear regression model.

Can anyone elaborate on why you might use a linear regression model for this? Are there any other models that would fit this task? Suggestions on R packages I could use, are welcome.


snp R • 1.8k views
ADD COMMENTlink modified 5.5 years ago by Devon Ryan96k • written 5.5 years ago by analyticalavailable30
gravatar for Devon Ryan
5.5 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

The versions of linear regression (well, a linear model really) that you're familiar with are ANOVAs and T-tests. This is essentially an ANOVA, possibly with a partial factorial design (depending on how you want to set things up).

BTW, the alternative to this would be to divide the scores by 100 and use logistic regression. This would be to get around some ceiling and floor effects in more complex models. Of course, if you only have essentially two alleles then a Wilcoxon test (aka Mann-Whitney U-test) could work. But that only makes sense if you just have two groups. That's essentially just a non-parametric T-test.

Anyway, graph the data and have a look. Then fit it with lm() in R and plot that too to see if the fit seems reasonable.

ADD COMMENTlink written 5.5 years ago by Devon Ryan96k

OK, I'm kind of new to this area so I am coming in a little blind. I've the score that varies between 0-100, for 1000 people. For each of those, I've the GG (most common homozygote) as 0, GA as 1 and AA (least common homozygote). What I am gathering from online is that I need to break each genotype down into it's own covariant in the model that I use, in order to get the differences between each group in terms of the score. So, I figured that I would look into a multiple linear regression model in SPSS, for which there is say a column for the score, then for each of the genotypes. The score being the dependent variable, and the genotypes being the three explanatory variables. Firstly, for example, there will be a different number of genotype values for each genotype, GG n=500, GA n=350 and AA n=150. If I would to set the score as my dependent variable n=1000, how does that work? Since the score wouldn't correspond across the row. This differing from a more standard model, where you would say have earnings as a dependent variable (n=1000), education level (n=1000) and experience (n=1000) as your explanatory variables. Do you get what I mean?

ADD REPLYlink written 5.5 years ago by analyticalavailable30

I'll given an example using R, since I don't use SPSS. There are actually two ways to go about this:

d <- data.frame(genotype=factor(c(rep("GG",500),rep("GA", 350),rep("AA",150)), levels=c("GG","GA","AA")), val=c(rnorm(500, 20, 20), rnorm(350, 40,20), rnorm(150,50,20)))
d$val[d$val<0] = 0
d$val[d$val>100] = 100
summary(lm(val~genotype, d))

The other way is to make this an additive model so you see estimate a homozygous A interaction (i.e., whether there's a simple additive effect of the genotype or if two copies of A produce a non-linear effect).

ADD REPLYlink written 5.5 years ago by Devon Ryan96k

I actually prefer R, so thank you very much for this. I believe that it is the additive model that I am looking for right now, since I would like to see what the interaction with the homozygous A.

Would you be able to suggest the function/library that is used to do that kind of analysis?

Data is currently organised something like the following:

Col1      Col2

GA        27.0

GA        57.0

GG       87.0

AA        15.0

I'll play around with the R implementation and try to get that working, but I am new to it.

ADD REPLYlink written 5.5 years ago by analyticalavailable30

Running both the lm and gam, I find that I am not getting the results that I need from the summary/anova.

(per G allele IRR 0.89, 95% confidence interval [95% CI] 0.82, 0.97; PLR  0.002)

So, in my case I should be seeing a percentage lower risk of the disease for AA over GA and GG.

So in the above case, GA individuals have a 11% lower risk than those of AA.

Any idea on where I should look for that information?

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by analyticalavailable30

Not off-hand, no.

ADD REPLYlink written 5.5 years ago by Devon Ryan96k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1397 users visited in the last hour