LInear Models with HLA-B alleles in R
1
0
Entering edit mode
7 weeks ago
schagas • 0

Hello,

I think this question might be more a statistics understanding than a code troubleshooting. Still, I am posting here because there are more R-users with a better knowledge of biological applications.

I want to understand some differences to represent the allele presence/absence with lm() function in R. As a dependent variable, I have ICU_days period, which I am trying to relate to HLA-B alleles. So, as an independent variable, I have HLA-B alleles adjusted for Age.

The first representation would be a dataset containing all alleles present in my sample as lm_designmatrix2:

id ICU_days Age allele
1         49  74   B*35
2         45  73   B*07
3         20  63   B*07
4         52  55   B*08
5         29  46   B*07
6         66  77   B*38
7         44  76   B*08
8         76  65   B*08
9         47  34   B*14
10        25  58   B*13
11        30  57   B*35
12         7  80   B*07
13        30  27   B*15
14        28  79   B*40
15         2  68   B*07
16        20  56   B*07
17        20  85   B*07
18        24  58   B*35
19        31  47   B*44
20        11  58   B*07


The linear model:

summary(lm(ICU_days ~ allele + Age, data = lm_designmatrix2))

Call:
lm(formula = ICU_days ~ allele + Age, data = lm_designmatrix2)

Residuals:
Min      1Q  Median      3Q     Max
-29.627 -11.032  -2.047   7.942  63.348

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  18.37896    7.75141   2.371   0.0190 *
alleleB*08    7.79160    6.49281   1.200   0.2321
alleleB*13   20.30319    7.62279   2.663   0.0086 **
alleleB*14   10.58582    7.56816   1.399   0.1640
alleleB*15   13.78044    6.81690   2.022   0.0450 *
alleleB*18    7.68169    7.56604   1.015   0.3116
alleleB*27  -17.64699   13.32670  -1.324   0.1875
alleleB*35    8.23656    5.62620   1.464   0.1453
alleleB*37   35.60011   18.31386   1.944   0.0538 .
alleleB*38   25.62458    9.83688   2.605   0.0101 *
[...] etc

Age           0.05827    0.10997   0.530   0.5970
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.8 on 147 degrees of freedom
Multiple R-squared:  0.1607,    Adjusted R-squared:  0.02362
F-statistic: 1.172 on 24 and 147 DF,  p-value: 0.2766


However, even though I know that is a dumb question, I didn't realize the difference between the above model with a dataset with dummy variables for each allele:

ICU_days   Age B*07 B*08 B*13 B*14
<int> <int>  <int>  <int>  <int>  <int>
1       30    27      0      0      0      0
2       47    34      0      0      0      1
3       21    35      0      0      0      0
4       22    37      1      0      0      0
5        3    39      0      0      0      0
6       14    39      1      1      0      0
7       49    39      0      0      0      1
8       84    39      1      0      1      0
9       30    40      0      0      0      0
10       32    40      0      0      0      0


And it performs linear modeling for each allele separately, like this:


summary(lm(ICU_days ~ Age + B*13, data = lm_designmatrix))

Call:
lm(formula = ICU_days ~ Age + B*13, data = lm_designmatrix)

Residuals:
Min     1Q Median     3Q    Max
-27.29 -14.72  -3.76  10.74  47.00

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.70441    8.70438   2.608   0.0108 *
Age          0.09689    0.14743   0.657   0.5129
B*13      14.02777    6.68684   2.098   0.0390 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.8 on 83 degrees of freedom
Multiple R-squared:  0.05161,   Adjusted R-squared:  0.02876
F-statistic: 2.258 on 2 and 83 DF,  p-value: 0.1109


Could anyone explain how to interpret these different representations?

R HLA • 357 views
2
Entering edit mode
7 weeks ago
LChart 840

Your first approach can be visualized as this picture:

With the "(Intercept)" coefficient, in your case, corresponding to the mean of Group 1.

The second approach can be visualized as this sequence of pictures:

where your "(Intercept)" corresponds to the mean value of "not_group". Note that if you are interested in the effect of an HLA allele compared to other HLA alleles, the second approach can be problematic, as the variance of "other group" is high due to mixing HLA alleles together into one group.

For instance, based on the last plot (and the corresponding regression), you may (incorrectly) infer that Group 3 is "not all that different" from other HLA groups, when in fact the first figure provides a clear readout that it is visually different from groups 1,2,4 and 7.

0
Entering edit mode

Thanks LChart ! So would the second approach be a loss in the "accuracy" of the estimates?

And Anyway, the two ways are not wrong, since depending on the statistical question can I choose one of them?

1
Entering edit mode

The estimates are accurate in both cases, but as you say, the questions are different. It is fairly common for contrasts of the second form to be used to make statements about contrasts of the first form -- such usage is inappropriate.