Hi all,

I need to run a PCA analysis on a microarray dataset to estimate which are the most relevant features. How can I do it? Is it possible?

I tried with `scikit-learn`

but I was unable to come with the relevant genes. I did it like this:

```
from sklearn.decomposition import PCA
import numpy as np
# X is the matrix transposed (n samples on the rows, m features on the columns) - it is represented as a numpy array
X = np.array([[-1, -2, 5, 1], [-3, -1, 1, 0], [-3, -2, 0, 2], [1, 1, 1, 3], [2, 1, 1, 4], [3, 2, 0, 5]])
# suppose we want the 2 most relevant features
nf = 2
pca = PCA(n_components=nf)
pca.fit(X)
X_proj = pca.transform(X)
```

With `X`

```
array([[-1, -2, 5, 1],
[-3, -1, 1, 0],
[-3, -2, 0, 2],
[ 1, 1, 1, 3],
[ 2, 1, 1, 4],
[ 3, 2, 0, 5]])
```

it returns `X_proj`

```
array([[-2.9999967 , 3.26498171],
[-3.53939268, -1.18864266],
[-2.77013188, -2.15637734],
[ 1.67612209, 0.03059917],
[ 2.87464655, 0.35674472],
[ 4.75875261, -0.30730559]])
```

How can I say which are the selected features? Is there another way to do it (also in R for example)?

Thanks

Hi Giovanni,

your and Brent answer are very useful. I did not understand well PCA before reading the example of psu university.

Thanks

Just to add, what is the preferred way to correlate a feature (that is a gene) with a principal component? May I use Pearson or cosine similarity? Or are there constraints for this data type? Thanks