## 1, create random data

```
randomdata <- matrix(rexp(800, rate=.1), ncol = 40)
colnames(randomdata) <- paste0("sample", 1:ncol(randomdata))
rownames(randomdata) <- paste0("gene", 1:nrow(randomdata))
randomdata[1:5,1:5]
sample1 sample2 sample3 sample4 sample5
gene1 0.3603136 1.159749 3.0826237 3.309323 0.5730092
gene2 35.0652692 20.814076 5.7487437 3.502387 1.2635026
gene3 8.2158417 6.442191 2.1134863 26.724393 9.2087334
gene4 19.6870141 6.617958 5.7245038 10.698933 9.1174246
gene5 6.8046549 5.262233 0.2679569 25.170229 8.3293813
```

## 2, create fake metadata for `DiseaseStatus`

and `BMI`

```
metadata <- data.frame(DiseaseStatus = c(rep("case", 20), rep("control", 20)), BMI = sample(10:40, 40, replace = TRUE))
head(metadata)
DiseaseStatus BMI
1 case 12
2 case 12
3 case 18
4 case 20
5 case 27
6 case 32
```

## 3, perform PCA and plot bi-plot for PC1 and PC2

```
pca <- prcomp(t(randomdata))
plot(pca$x[,1:2])
```

Also see my answer here for more 'PCA' plots using base R functions: A: PCA plot from read count matrix from RNA-Seq

## 4, now prepare an object for regression modeling

```
modeling <- data.frame(metadata, pca$x[,1:2], t(randomdata))
modeling[1:5,1:5]
DiseaseStatus BMI PC1 PC2 gene1
sample1 case 12 -18.485678 8.3374667 0.3603136
sample2 case 12 -3.393260 -0.9962191 1.1597494
sample3 case 18 6.579711 -10.1392232 3.0826237
sample4 case 20 13.311610 26.2884376 3.3093231
sample5 case 27 16.998644 -3.0140936 0.5730092
# set control as reference level
modeling$DiseaseStatus <- factor(modeling$DiseaseStatus, levels = c('control','case'))
```

## 5, binary logistic regression - adjust for `BMI`

, `PC1`

, and `PC2`

```
summary(glm(DiseaseStatus ~ gene1 + BMI + PC1 + PC2, data = modeling, family = binomial(link = 'logit')))
Call:
glm(formula = DiseaseStatus ~ gene1 + BMI + PC1 + PC2, family = binomial(link = "logit"),
data = modeling)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.43692 -1.12846 0.01018 1.15101 1.33862
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.266677 1.030321 -0.259 0.796
gene1 -0.009663 0.041650 -0.232 0.817
BMI 0.013031 0.036195 0.360 0.719
PC1 -0.004580 0.018005 -0.254 0.799
PC2 -0.010359 0.020470 -0.506 0.613
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 55.452 on 39 degrees of freedom
Residual deviance: 54.908 on 35 degrees of freedom
AIC: 64.908
Number of Fisher Scoring iterations: 4
```

## 6, linear regression - adjust for `BMI`

, `PC1`

, and `PC2`

```
summary(lm(gene1 ~ DiseaseStatus + BMI + PC1 + PC2, data = modeling))
Call:
lm(formula = gene1 ~ DiseaseStatus + BMI + PC1 + PC2, data = modeling)
Residuals:
Min 1Q Median 3Q Max
-8.447 -5.121 -2.338 3.445 31.670
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.25736 4.11994 2.004 0.0528 .
DiseaseStatuscase -0.56933 2.60968 -0.218 0.8286
BMI -0.04532 0.14733 -0.308 0.7602
PC1 0.13024 0.06949 1.874 0.0693 .
PC2 -0.11322 0.08107 -1.397 0.1713
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.202 on 35 degrees of freedom
Multiple R-squared: 0.1373, Adjusted R-squared: 0.03869
F-statistic: 1.392 on 4 and 35 DF, p-value: 0.2567
```

## -------------------------------------

You should have good justification for wanting to adjust for PCs. Usually, it is performed in genetic association studies for the purpose of adjusting for population stratification.

Don't just adjust for PCs because your supervisor may have told you to do it - question him/her why. One should not add covariates to a regression model without justification.

Kevin