Question: Which tool to use to run a PCA on a microarray dataset to select the most relevant features (preferably in Python)
1
4.9 years ago by
fbrundu280
European Union
fbrundu280 wrote:

Hi all,

I need to run a PCA analysis on a microarray dataset to estimate which are the most relevant features. How can I do it? Is it possible?

I tried with `scikit-learn `but I was unable to come with the relevant genes. I did it like this:

``````from sklearn.decomposition import PCA
import numpy as np
```

`# X is the matrix transposed (n samples on the rows, m features on the columns) - it is represented as a numpy array`
`X = np.array([[-1, -2, 5, 1], [-3, -1, 1, 0], [-3, -2, 0, 2], [1, 1, 1, 3], [2, 1, 1, 4], [3, 2, 0, 5]])`

```# suppose we want the 2 most relevant features
nf = 2
pca = PCA(n_components=nf)
pca.fit(X)```

`X_proj = pca.transform(X)````

With `X`

```array([[-1, -2,  5,  1],        [-3, -1,  1,  0],        [-3, -2,  0,  2],        [ 1,  1,  1,  3],        [ 2,  1,  1,  4],        [ 3,  2,  0,  5]])```

it returns `X_proj`

```array([[-2.9999967 ,  3.26498171],        [-3.53939268, -1.18864266],        [-2.77013188, -2.15637734],        [ 1.67612209,  0.03059917],        [ 2.87464655,  0.35674472],        [ 4.75875261, -0.30730559]])```

How can I say which are the selected features? Is there another way to do it (also in R for example)?

Thanks

R microarray pca features python • 3.0k views
modified 4.9 years ago by Giovanni M Dall'Olio26k • written 4.9 years ago by fbrundu280
5
4.9 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

Hi,

be careful because I think you are making some confusion about the concept of "components" in PCA.

The PCA doesn't return you the most relevant "features" in the dataset; it rather returns two vectors (or more, depending on how many components you calculate), corresponding to the coordinates of each element on a plane that divides the datasets into two parts.  For a quick explanation of how PCA works, you can read this document or this article on Nature Biotechnology.

If you want to determine how much each of your original variables contribute to the components, you can either look at the loadings of the pca (I can't install scikit right now, but it may be inside the pca object), or you can calculate the correlation of each variable with each of the components, like described here. This should give you an idea of the "relevant" features, the features that contribute the most to divide the data into two.

Another tool to do PCA with python is orange, which gives you both a graphical interface (if you want to use it), and a python library.

Hi Giovanni,

your and Brent answer are very useful. I did not understand well PCA before reading the example of psu university.

Thanks

Just to add, what is the preferred way to correlate a feature (that is a gene) with a principal component? May I use Pearson or cosine similarity? Or are there constraints for this data type? Thanks

4
4.9 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

You can do something like that with this tool:

You are asking what are the selected features, but it doesn't work like that. You  get the components with the most variance. You can then take those and correlate them with your clinical or lab data, which is what the above software does.