Hi all,

I need to run a PCA analysis on a microarray dataset to estimate which are the most relevant features. How can I do it? Is it possible?

I tried with `scikit-learn `

but I was unable to come with the relevant genes. I did it like this:

`from sklearn.decomposition import PCA import numpy as np`

`# X is the matrix transposed (n samples on the rows, m features on the columns) - it is represented as a numpy array`

`X = np.array([[-1, -2, 5, 1], [-3, -1, 1, 0], [-3, -2, 0, 2], [1, 1, 1, 3], [2, 1, 1, 4], [3, 2, 0, 5]])`

`# suppose we want the 2 most relevant features nf = 2 pca = PCA(n_components=nf) pca.fit(X)`

`X_proj = pca.transform(X)`

With `X`

`array([[-1, -2, 5, 1],`

[-3, -1, 1, 0],

[-3, -2, 0, 2],

[ 1, 1, 1, 3],

[ 2, 1, 1, 4],

[ 3, 2, 0, 5]])

it returns `X_proj`

`array([[-2.9999967 , 3.26498171],`

[-3.53939268, -1.18864266],

[-2.77013188, -2.15637734],

[ 1.67612209, 0.03059917],

[ 2.87464655, 0.35674472],

[ 4.75875261, -0.30730559]])

How can I say which are the selected features? Is there another way to do it (also in R for example)?

Thanks

Hi Giovanni,

your and Brent answer are very useful. I did not understand well PCA before reading the example of psu university.

Thanks

Just to add, what is the preferred way to correlate a feature (that is a gene) with a principal component? May I use Pearson or cosine similarity? Or are there constraints for this data type? Thanks