Question: R: Linear model with interaction: quantitative variable that depends on both quantitative and qualitative variables.
gravatar for rednalf
3.8 years ago by
rednalf60 wrote:

I have a data frame that looks like this:

ID  SEX  PHENO  HLA1  HLA2  HLA3  KIR1  KIR2      Output        PC1
A1    1      1     2     1     1     1     1    4.643453    -0.0062   
A2    2      1     1     1     1     1    NA    4.954243    -0.0207

I am interested in analysing the effect of the interaction between the HLAs and the KIRs, depending on several covariates, using R.

In the data set, I have the following variables:

  • ID: individual IDs
  • SEX: categorical variable having two categories (male and female) = covariate
  • HLA (1 to 3): observed HLA phenotype = categorical variable (NA: missing; 1: absent; 2: present) =
  • KIR (1 to 2): observed KIR phenotype = categorical variable (NA: missing; 1: absent; 2: present)
  • Output: continous quantitative variable
  • PC1: continous quantitative variable = covariate

Per KIR/HLA interaction, I was thinking in creating the interaction model using a linear regression model with interaction as follows:

lm(Output ~ HLA + KIR + HLA*KIR + PC1 + SEX)

...where PC1 and SEX are the covariates.

Does someone know if this is a good model since I have both categorical and quantitative variables in my dataset? Should I use another model?

Thank you for your help.

ADD COMMENTlink written 3.8 years ago by rednalf60

HLA*KIR is the same as HLA + KIR + HLA:KIR, so you have a bit of redundancy. Aside from that I don't see anything obviously wrong.

ADD REPLYlink written 3.8 years ago by Devon Ryan95k

Standard practise is to encode your categorical variates as a numerical value (a dummy / indicator variable). R's model formulas will convert categorical variables (ie, factors) into dummies automatically, unfortunately you've encoded HLA (and KIR) in a way that won't permit that. For your mdel formula to work, HLA would have to be encoded as a factor in a single column, presumably this would have 8 levels (None, HLA1, HLA2, HLA3, HLA12, ..., HLA123), although I can't quite follow what the difference between 'missing' and 'absent' is in your description.

The model you have proposed is very complicated, and I would strongly urge you to compare the results against smaller nested models

ADD REPLYlink written 3.8 years ago by russhh5.4k

Thank you for your answer. In the example above, missing (NA) means that no phenotype data is available, while absent (1) means that a certain phenotype is known to be absent for the individual (vs. present). If I change the coding from 1/2 to 0/1, 0 being absent and 1 present can R then convert the categorical variables into dummies automatically?

ADD REPLYlink written 3.8 years ago by rednalf60

Not quite. Modifying the dataframe directly wouldn't include the HLA/KIR interaction term and if you want to define your own design matrix, you should use instead of lm().

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by russhh5.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 756 users visited in the last hour