Hey everyone,
I'm inexperienced with statistics, and want to perform a regression between two diseases. I would really appreciate some clarification if my understanding of when to include an interaction term is correct.
Let's call the diseases disease Y and disease X. I know that disease Y is age-dependent, i.e. it's more likely to be encountered in older individuals and it becomes progressively worse with age. So I think the regression should definitely include age as a covariate:
Y ~ age + X
I'm not sure whether disease X is also age-dependent, but it might be. My planned approach was to look into the data, check if an independent student's t-test detects a significant difference in the age distributions between people with / without disease X. If yes, I would correct the regression formula to include an interaction term:
Y ~ age + age:X + X
Would this approach be correct?
Additionally, would it matter if my variable Y represents case/control status or disease severity (i.e. logistic vs linear regression)?
Thank you for the detailed answer and examples! I was a bit off then, thinking that correlation and interaction are a similar or the same thing.
You're welcome! Correlation describes the relationship between two variables (e.g. age and X), but interaction describes how the combination of age and X affect Y. Also, if you decide to combine two variables, you should take the average. Don't just add them as I suggested above.
If I find that X does in fact correlate with age, would it also be an option to first regress X on age and then use the residual in the actual regression instead of X?
Taking an average seems odd to me in the scenario I have in mind, if for example I have X denoted as 0/1 (unaffected or affected by the disease), as the average of age 59 + X 1 would be the same as age 60 + X 0. Or am I misunderstanding this?
The simplest solution would be to drop X. If age and X are highly correlated, then dropping one of them will have little effect on your model. For combining numerical variables, you would need to scale them first. This gives each variable a mean of 0 and a standard deviation of 1. Otherwise, variables with large absolute values would contribute more to the new variable. I'm not sure it makes sense to combine a numerical variable with a categorical one, though.
Thank you for the replies again! I'm not sure, but I think dropping X or dropping age is not an option for me, since I don't want to model Y but want to test for association between X and Y. If I drop age, I think I would probably see a correlation between X and Y even it is only based on the shared correlation with age. If I drop X, I wouldn't be able to test for my hypothesis.
Good point about scaling! I will think about that.