Question: Which tool to use to run a PCA on a microarray dataset to select the most relevant features (preferably in Python)
gravatar for fbrundu
5.4 years ago by
European Union
fbrundu280 wrote:

Hi all,

I need to run a PCA analysis on a microarray dataset to estimate which are the most relevant features. How can I do it? Is it possible?

I tried with scikit-learn but I was unable to come with the relevant genes. I did it like this:

from sklearn.decomposition import PCA
import numpy as np

# X is the matrix transposed (n samples on the rows, m features on the columns) - it is represented as a numpy array
X = np.array([[-1, -2, 5, 1], [-3, -1, 1, 0], [-3, -2, 0, 2], [1, 1, 1, 3], [2, 1, 1, 4], [3, 2, 0, 5]])

# suppose we want the 2 most relevant features
nf = 2
pca = PCA(n_components=nf)

X_proj = pca.transform(X)


With X

array([[-1, -2,  5,  1],
       [-3, -1,  1,  0],
       [-3, -2,  0,  2],
       [ 1,  1,  1,  3],
       [ 2,  1,  1,  4],
       [ 3,  2,  0,  5]])

it returns X_proj

array([[-2.9999967 ,  3.26498171],
       [-3.53939268, -1.18864266],
       [-2.77013188, -2.15637734],
       [ 1.67612209,  0.03059917],
       [ 2.87464655,  0.35674472],
       [ 4.75875261, -0.30730559]])


How can I say which are the selected features? Is there another way to do it (also in R for example)?


R microarray pca features python • 3.1k views
ADD COMMENTlink modified 5.4 years ago by Giovanni M Dall'Olio26k • written 5.4 years ago by fbrundu280
gravatar for Giovanni M Dall'Olio
5.4 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:


be careful because I think you are making some confusion about the concept of "components" in PCA.

The PCA doesn't return you the most relevant "features" in the dataset; it rather returns two vectors (or more, depending on how many components you calculate), corresponding to the coordinates of each element on a plane that divides the datasets into two parts.  For a quick explanation of how PCA works, you can read this document or this article on Nature Biotechnology.

If you want to determine how much each of your original variables contribute to the components, you can either look at the loadings of the pca (I can't install scikit right now, but it may be inside the pca object), or you can calculate the correlation of each variable with each of the components, like described here. This should give you an idea of the "relevant" features, the features that contribute the most to divide the data into two.

Another tool to do PCA with python is orange, which gives you both a graphical interface (if you want to use it), and a python library.

ADD COMMENTlink written 5.4 years ago by Giovanni M Dall'Olio26k

Hi Giovanni,

your and Brent answer are very useful. I did not understand well PCA before reading the example of psu university.


ADD REPLYlink written 5.4 years ago by fbrundu280

Just to add, what is the preferred way to correlate a feature (that is a gene) with a principal component? May I use Pearson or cosine similarity? Or are there constraints for this data type? Thanks

ADD REPLYlink written 5.4 years ago by fbrundu280
gravatar for brentp
5.4 years ago by
Salt Lake City, UT
brentp23k wrote:

You can do something like that with this tool:


You are asking what are the selected features, but it doesn't work like that. You  get the components with the most variance. You can then take those and correlate them with your clinical or lab data, which is what the above software does.

ADD COMMENTlink written 5.4 years ago by brentp23k

Thanks for your answer, very useful. I will try it. 

ADD REPLYlink written 5.4 years ago by fbrundu280

I read your code but unfortunately I have to integrate a smaller module in my code and I have to write a custom routine, sorry. When you said, you get the components with the most variance, can you consider the components as pseudo-samples? Thanks anyway for your effort.

ADD REPLYlink written 5.4 years ago by fbrundu280
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 796 users visited in the last hour