Regression - when to include interaction term?
1
1
Entering edit mode
20 months ago
Beqy ▴ 30

Hey everyone,

I'm inexperienced with statistics, and want to perform a regression between two diseases. I would really appreciate some clarification if my understanding of when to include an interaction term is correct.

Let's call the diseases disease Y and disease X. I know that disease Y is age-dependent, i.e. it's more likely to be encountered in older individuals and it becomes progressively worse with age. So I think the regression should definitely include age as a covariate:

Y ~ age + X

I'm not sure whether disease X is also age-dependent, but it might be. My planned approach was to look into the data, check if an independent student's t-test detects a significant difference in the age distributions between people with / without disease X. If yes, I would correct the regression formula to include an interaction term:

Y ~ age + age:X + X

Would this approach be correct?

Additionally, would it matter if my variable Y represents case/control status or disease severity (i.e. logistic vs linear regression)?

covariate interaction regression statistics • 1.2k views
ADD COMMENT
2
Entering edit mode
20 months ago
Jeremy ▴ 880

It's best practice to first check if your variables are correlated. If they are, you should either drop one or combine them into one variable. In R:

cor.test(your_data$age, your_data$X)

I would drop one of the variables if r >= 0.5, although others may use a different cutoff. If they are correlated, I would keep the variable with the lowest p-value. Alternatively, you could combine age and X into one variable by adding them or taking their average. To find p-values:

model = lm(Y ~ age + X, data = your_data)
summary(model)

If age and X are not correlated, then you can see if there is an interaction.

int.model = lm(Y ~ age + X + age:X, data = your_data)
summary(int.model)

If the interaction term has a significant p-value, then you'll want to include it in your model. If not, then you'll want to drop it. You can use either linear or logistic regression. For logistic regression, you would use the following:

logit.model = glm(Y ~ age + X + age:X, data = your_data, family = binomial)
summary(logit.model)
ADD COMMENT
1
Entering edit mode

Thank you for the detailed answer and examples! I was a bit off then, thinking that correlation and interaction are a similar or the same thing.

ADD REPLY
0
Entering edit mode

You're welcome! Correlation describes the relationship between two variables (e.g. age and X), but interaction describes how the combination of age and X affect Y. Also, if you decide to combine two variables, you should take the average. Don't just add them as I suggested above.

ADD REPLY
0
Entering edit mode

If I find that X does in fact correlate with age, would it also be an option to first regress X on age and then use the residual in the actual regression instead of X?

Taking an average seems odd to me in the scenario I have in mind, if for example I have X denoted as 0/1 (unaffected or affected by the disease), as the average of age 59 + X 1 would be the same as age 60 + X 0. Or am I misunderstanding this?

ADD REPLY
0
Entering edit mode

The simplest solution would be to drop X. If age and X are highly correlated, then dropping one of them will have little effect on your model. For combining numerical variables, you would need to scale them first. This gives each variable a mean of 0 and a standard deviation of 1. Otherwise, variables with large absolute values would contribute more to the new variable. I'm not sure it makes sense to combine a numerical variable with a categorical one, though.

ADD REPLY
1
Entering edit mode

Thank you for the replies again! I'm not sure, but I think dropping X or dropping age is not an option for me, since I don't want to model Y but want to test for association between X and Y. If I drop age, I think I would probably see a correlation between X and Y even it is only based on the shared correlation with age. If I drop X, I wouldn't be able to test for my hypothesis.

Good point about scaling! I will think about that.

ADD REPLY

Login before adding your answer.

Traffic: 1498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6