Question: R: Linear model with interaction: quantitative variable that depends on both quantitative and qualitative variables.
1
rednalf60 wrote:

I have a data frame that looks like this:

``````ID  SEX  PHENO  HLA1  HLA2  HLA3  KIR1  KIR2      Output        PC1
A1    1      1     2     1     1     1     1    4.643453    -0.0062
A2    2      1     1     1     1     1    NA    4.954243    -0.0207
``````

I am interested in analysing the effect of the interaction between the HLAs and the KIRs, depending on several covariates, using R.

In the data set, I have the following variables:

• ID: individual IDs
• SEX: categorical variable having two categories (male and female) = covariate
• HLA (1 to 3): observed HLA phenotype = categorical variable (NA: missing; 1: absent; 2: present) =
• KIR (1 to 2): observed KIR phenotype = categorical variable (NA: missing; 1: absent; 2: present)
• Output: continous quantitative variable
• PC1: continous quantitative variable = covariate

Per KIR/HLA interaction, I was thinking in creating the interaction model using a linear regression model with interaction as follows:

lm(Output ~ HLA + KIR + HLA*KIR + PC1 + SEX)

...where PC1 and SEX are the covariates.

Does someone know if this is a good model since I have both categorical and quantitative variables in my dataset? Should I use another model?

written 3.8 years ago by rednalf60

`HLA*KIR` is the same as `HLA + KIR + HLA:KIR`, so you have a bit of redundancy. Aside from that I don't see anything obviously wrong.

Standard practise is to encode your categorical variates as a numerical value (a dummy / indicator variable). R's model formulas will convert categorical variables (ie, factors) into dummies automatically, unfortunately you've encoded HLA (and KIR) in a way that won't permit that. For your mdel formula to work, HLA would have to be encoded as a factor in a single column, presumably this would have 8 levels (None, HLA1, HLA2, HLA3, HLA12, ..., HLA123), although I can't quite follow what the difference between 'missing' and 'absent' is in your description.

The model you have proposed is very complicated, and I would strongly urge you to compare the results against smaller nested models

Thank you for your answer. In the example above, missing (NA) means that no phenotype data is available, while absent (1) means that a certain phenotype is known to be absent for the individual (vs. present). If I change the coding from 1/2 to 0/1, 0 being absent and 1 present can R then convert the categorical variables into dummies automatically?