Hi all,
I need to run a PCA analysis on a microarray dataset to estimate which are the most relevant features. How can I do it? Is it possible?
I tried with scikit-learn but I was unable to come with the relevant genes. I did it like this:
from sklearn.decomposition import PCA
import numpy as np
# X is the matrix transposed (n samples on the rows, m features on the columns) - it is represented as a numpy array
X = np.array([[-1, -2, 5, 1], [-3, -1, 1, 0], [-3, -2, 0, 2], [1, 1, 1, 3], [2, 1, 1, 4], [3, 2, 0, 5]])
# suppose we want the 2 most relevant features
nf = 2
pca = PCA(n_components=nf)
pca.fit(X)
X_proj = pca.transform(X) 
With X
array([[-1, -2,  5,  1],
       [-3, -1,  1,  0],
       [-3, -2,  0,  2],
       [ 1,  1,  1,  3],
       [ 2,  1,  1,  4],
       [ 3,  2,  0,  5]])
it returns X_proj
array([[-2.9999967 ,  3.26498171],
       [-3.53939268, -1.18864266],
       [-2.77013188, -2.15637734],
       [ 1.67612209,  0.03059917],
       [ 2.87464655,  0.35674472],
       [ 4.75875261, -0.30730559]])
How can I say which are the selected features? Is there another way to do it (also in R for example)?
Thanks
Hi Giovanni,
your and Brent answer are very useful. I did not understand well PCA before reading the example of psu university.
Thanks
Just to add, what is the preferred way to correlate a feature (that is a gene) with a principal component? May I use Pearson or cosine similarity? Or are there constraints for this data type? Thanks