Question: Gene Lists Using Principal Component Analysis In Microarray Gene Expression
9
gravatar for Tonig
7.8 years ago by
Tonig430
Tonig430 wrote:

Dear all,

i'm a totally newbie on PCA analysis, so here is my question: I'm working with a list of genes coming from Microarray gene expression analysis; let's say i have the genes in rows and the sample names in the columns, i did a PCA analysis in R using princomp in order to reduce the dimensionality of genes (i.e approx. 400). I know that I must choose the components with higher variance over the total, that is the first two. The problem arises when I have to choose those genes that contribute most in each component to the amount of variance: May I use the scores for each gene? May I choose these genes only for first component or from both two components?

Thanks

R pca microarray • 16k views
ADD COMMENTlink modified 7.3 years ago by Janne Marie Laursen160 • written 7.8 years ago by Tonig430
6

Just a note that even though PC1 captures the largest share of the variance, it is not always the most interesting biologically. Sometimes PC1 captures non-biologically-interesting features like technical artifacts, batch effects, and the like. Some caution is required in interpretation....

ADD REPLYlink written 7.8 years ago by Sean Davis25k
13
gravatar for Janne Marie Laursen
7.8 years ago by
Copenhagen, Denmark
Janne Marie Laursen160 wrote:

As far as I understand you want to find the genes (p) that are the sources of the the majority of the variance between your samples (n). You will have to look at entries in your loadings vectors.

pca.object <- princomp(data.matrix)  # data.matrix is a [n p] matrix
pca.object$loadings  # Your loadings are here

Then you can look at which genes of the genes that have the most extreme loadings.
Loadings range from -1 to 1, and the higher the numerical value of a gene's loading is, the more that gene means for the variance of the principal component in question.
On the other hand, the scores of e.g. PC1 will tell you how the samples differ according to the genes that have high loadings on PC1.

Just a side-note: Have you considered scaling your data-matrix?

EDIT:

You want to see which genes that mean the most for the differences between the samples, and therefore your samples should be in the rows and your genes should be in the columns. As far as I see, you should not transpose your data matrix.

And by the way, in R, use the prcomp function instead of princomp (for numerical stability). prcomp also has the input option of centering and scaling, which you would like if the magnitude of the numbers in your matrix are not of a comparable size.

pca.object <- prcomp(data.matrix, center=TRUE, scale=TRUE)  # PCA with centering and scaling
pca.object$rotation  # The loadings are here
ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Janne Marie Laursen160

Many thanks Janne! I'll try that way. However, i don't know if i'm doing in the right way: Must I transpose the data (genes on columns and samples on rows), or can I follow in the same way?

ADD REPLYlink written 7.8 years ago by Tonig430

See my edit above :)

ADD REPLYlink written 7.8 years ago by Janne Marie Laursen160

Janne Marie,

Can you please elaborate on what you mean by:

"prcomp also has the input option of centering and scaling, which you would like if the magnitude of the numbers in your matrix are not of a comparable size."

I don't get when it is that the magnitude of the numbers are not of a comparable size? I am working with gene expression counts (RNA-seq) and each sample's gene counts have been normalized to the library size/sequencing depth... should I still perform this "centering" and "scaling"?

Thank you.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by gaelgarcia150
2

For RNA-seq applications, you may need to apply a variance stabilizing transformation.  A simple one is to log the counts.  However, more robust ones exist.  See, for example, the Bioconductor DESeq2 package vignette, which has a section on visualizing RNA-seq data.

ADD REPLYlink written 4.4 years ago by Sean Davis25k

Thank you Sean. I am variance-stabilizing my RNA seq data with DESeq's rlog function -- is it still required that I center and/or scale the normalized-variance stabilized counts?

ADD REPLYlink modified 4.2 years ago • written 4.4 years ago by gaelgarcia150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 601 users visited in the last hour