Was wondering if anyone can help me understand how to perform factor analysis/loading to assess the correlation between individual GEO-derived samples/arrays and the first principal component in R. I refer you to this paper by Heijink et al in Oncogene (2011) (DOI:10.1038/onc.2010.578; PMID: 21217777), where this method was used as a quality control index (Figure 1.)
I have collected GPL570 microarray data from the GEO (32 samples). PCA on the log2 probeset (which I transposed for the analysis, so that rows = samples and probesets = columns) values gives me following cumulative variance per PC:
PC1 (0.13), PC2 (0.21), PC3 (0.28), PC4 (0.34), PC5 (0.38), PC6 (0.43), PC7 (0.47), PC8 (0.50), PC9 (0.53), PC10 (0.57), PC11 (0.60), PC12 (0.62), PC13 (0.65), PC14 (0.68), PC15 (0.70), PC16 (0.73), PC17 (0.75), PC18 (0.77), PC19 (0.79), PC20 (0.82)....not exhaustive, but I thought on this basis that including 20 factors would cover it for the factor analysis.
Is it enough now to perform factor analysis on the original loaded in data-frame (columns = samples, rows = probesets) to assess how each sample correlates with PC1 (as per the following code)? I would not know how to perform this otherwise as doing the factor analysis on the transposed data-set (used for the original PCA) would give me the loadings for the probesets, not the samples.
PCA.fa <- factanal(GPL570_excluding_GSE18549X, factors = 20, rotation = "varimax")
Thanks a lot in advance