8.7 years ago by

Charlottesville Virginia

Thanks Neil. I did exactly what you recommended using the iris data and the results are strikingly different. Looks like it's what both of us expected - you likely can't simply use the PCs computed on all the data for an analysis with a subset.

```
data(iris)
# Fit prcomp to all 150 observations
pca_all <- prcomp(~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iris)
# Extract data rotated multiplied by rotation matrix and merge with orig data.
iris_pca_all <- data.frame(iris,pca_all$x)
# Plot PC2 by PC1 for only the subset of 'setosa' species, add linear trendline
with(subset(iris_pca_all, Species=='setosa'), plot(PC1,PC2, pch=20))
abline(lm(PC2~PC1, data=subset(iris_pca_all, Species=='setosa')))
# Fit prcomp to only the 50 setosa species individuals
pca_setosa <- prcomp(~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=subset(iris_pca_all, Species=='setosa'))
# Plot PC2 by PC1 for PC loadings computed from the subset only.
with(as.data.frame(pca_setosa$x), plot(PC1,PC2,pch=20))
abline(lm(PC2~PC1, data=as.data.frame(pca_setosa$x)))
```

I would suggest asking here: http://stats.stackexchange.com/

1.1k