I currently am working with the curatedOvarianData datasets in R. I have a matrix with the rows corresponding to patients, and the columns corresponding to genes. The dimensions are 500 patients by 15,000 genes. The entries in the matrix are affymetrix microarray gene expression values.
I would like to reduce the dimensionality of the 15,000 genes to something like 50 genes instead. My understanding is that PCA leaves us with principal components, which are no longer the original genes in question. Does this not destroy the notion of gene selection here if we use PCA since the PC's are no longer interpretable as genes?
Or is the dimensionality reduction in PCA for genes done by first getting the rotation matrix, then looking at which of the genes in PC1, PC2 (columns below) are the greatest in absolute value? I have attached a sample rotation matrix.
> pca$rotation[1:20,1:5]
PC1 PC2 PC3 PC4 PC5
A4GALT 0.0196827054 0.0219501821 -0.0119041508 -0.0125913799 -0.0300983052
AAAS -0.0218551318 -0.0277876082 -0.0016994535 -0.0007223474 0.0041899425
AACS 0.0114888058 -0.0264566521 0.0135052707 0.0152150913 -0.0015691410
AAMP -0.0311951167 0.0198613880 0.0073099544 -0.0050461228 -0.0051730502
AARS -0.0340074461 0.0212069696 0.0099504761 -0.0033812802 -0.0009392291
AARS2 0.0139070432 0.0199618531 0.0134008680 -0.0273997456 0.0114918086
AATF 0.0213326559 -0.0098678266 -0.0210267057 0.0091445907 0.0096835751
ABCA2 0.0296887290 -0.0027064174 0.0100097167 0.0076224764 -0.0224561247
ABCA7 0.0131463131 0.0048875505 0.0118058253 0.0198648456 -0.0326402302
ABCB8 -0.0139091809 0.0259562948 -0.0047861865 -0.0050800853 -0.0064526022
ABCC10 0.0220138583 0.0023235200 0.0062769418 -0.0001750348 0.0178198326
ABCC12 0.0101936846 -0.0007018871 0.0081293549 0.0028895290 -0.0161180293
ABHD14B 0.0025368873 -0.0111178522 -0.0122933586 -0.0054849412 0.0256141796
ABHD2 -0.0043373912 -0.0052438089 0.0116903649 -0.0232808325 0.0116003430
ABLIM1 0.0097392619 -0.0180788647 -0.0263126770 0.0176361009 0.0056645424
ACAT2 -0.0006405139 -0.0049319713 0.0269504961 -0.0227320346 0.0178072176
ACKR2 0.0154337278 0.0038119061 0.0053536391 0.0270810357 -0.0127872709
ACLY -0.0179387141 0.0002761814 -0.0009650902 -0.0284643875 -0.0076105308
ACP6 0.0008680205 0.0116181794 -0.0113021259 0.0058295288 -0.0358689309
ACPP -0.0197297993 -0.0170817212 0.0037807080 -0.0472590092 0.0264044473
Can you comment on what your actual analysis goal is? Do you want to find patterns in the data? If so, what is the underlying question?
All I want to do is to reduce the dimensionality of the data into a few (50 or so) genes that are predictive of a response variable. In other words, I'd like to reduce the large number of genes to something more manageable I can then put into machine learning methods for training and testing. Thank you!