Question

Given microarray expressions, what is the best way to use PCA for reducing dimensionality in a dataset?

0

Entering edit mode

4.0 years ago

researcherfun • 0

I currently am working with the curatedOvarianData datasets in R. I have a matrix with the rows corresponding to patients, and the columns corresponding to genes. The dimensions are 500 patients by 15,000 genes. The entries in the matrix are affymetrix microarray gene expression values.

I would like to reduce the dimensionality of the 15,000 genes to something like 50 genes instead. My understanding is that PCA leaves us with principal components, which are no longer the original genes in question. Does this not destroy the notion of gene selection here if we use PCA since the PC's are no longer interpretable as genes?

Or is the dimensionality reduction in PCA for genes done by first getting the rotation matrix, then looking at which of the genes in PC1, PC2 (columns below) are the greatest in absolute value? I have attached a sample rotation matrix.

> pca$rotation[1:20,1:5]
                  PC1           PC2           PC3           PC4           PC5
A4GALT   0.0196827054  0.0219501821 -0.0119041508 -0.0125913799 -0.0300983052
AAAS    -0.0218551318 -0.0277876082 -0.0016994535 -0.0007223474  0.0041899425
AACS     0.0114888058 -0.0264566521  0.0135052707  0.0152150913 -0.0015691410
AAMP    -0.0311951167  0.0198613880  0.0073099544 -0.0050461228 -0.0051730502
AARS    -0.0340074461  0.0212069696  0.0099504761 -0.0033812802 -0.0009392291
AARS2    0.0139070432  0.0199618531  0.0134008680 -0.0273997456  0.0114918086
AATF     0.0213326559 -0.0098678266 -0.0210267057  0.0091445907  0.0096835751
ABCA2    0.0296887290 -0.0027064174  0.0100097167  0.0076224764 -0.0224561247
ABCA7    0.0131463131  0.0048875505  0.0118058253  0.0198648456 -0.0326402302
ABCB8   -0.0139091809  0.0259562948 -0.0047861865 -0.0050800853 -0.0064526022
ABCC10   0.0220138583  0.0023235200  0.0062769418 -0.0001750348  0.0178198326
ABCC12   0.0101936846 -0.0007018871  0.0081293549  0.0028895290 -0.0161180293
ABHD14B  0.0025368873 -0.0111178522 -0.0122933586 -0.0054849412  0.0256141796
ABHD2   -0.0043373912 -0.0052438089  0.0116903649 -0.0232808325  0.0116003430
ABLIM1   0.0097392619 -0.0180788647 -0.0263126770  0.0176361009  0.0056645424
ACAT2   -0.0006405139 -0.0049319713  0.0269504961 -0.0227320346  0.0178072176
ACKR2    0.0154337278  0.0038119061  0.0053536391  0.0270810357 -0.0127872709
ACLY    -0.0179387141  0.0002761814 -0.0009650902 -0.0284643875 -0.0076105308
ACP6     0.0008680205  0.0116181794 -0.0113021259  0.0058295288 -0.0358689309
ACPP    -0.0197297993 -0.0170817212  0.0037807080 -0.0472590092  0.0264044473

pca • 723 views

ADD COMMENT • link updated 4.0 years ago by Kevin Blighe 87k • written 4.0 years ago by researcherfun • 0

0

Entering edit mode

Can you comment on what your actual analysis goal is? Do you want to find patterns in the data? If so, what is the underlying question?

ADD REPLY • link 4.0 years ago by ATpoint 82k

0

Entering edit mode

All I want to do is to reduce the dimensionality of the data into a few (50 or so) genes that are predictive of a response variable. In other words, I'd like to reduce the large number of genes to something more manageable I can then put into machine learning methods for training and testing. Thank you!

ADD REPLY • link 4.0 years ago by researcherfun • 0

score 0 · Answer 1 · 2020-04-29

Hey,

You could start by reading some of my previous posts here on Biostars:

Also, I (more specifically, Aaron Lun) have/has addressed the issue of choosing an optimal number of PCs in our package, PCAtools:

4.1 Determine optimum number of PCs to retain

It looks like you have performed PCA via prcomp(), in which case rotation comprises the gene/variable loadings. You can technically go by those and select the ones with highest absolute values over a certain number of PCs, if you wish. For example, if PC 1-6 comprise 90% expalined variation in your dataset, then, looking at the gene loadings across these 6 PCs would inform you of those genes that are most responsible for —i.e., those that are driving— this 90% explained variation.

It is the loadings that I plot via PCAtools::plotLoadings():

See 4.4 Determine the variables that drive variation among each PC.

In fact, if you go through the entire vignette for PCAtools, you'll see how I am also using microarray data and how I identify sample separation based on Oestrogen Receptor status along PC2. Then, via the loadings plot, I identify how the gene, ESR1, which encodes the alpha chain of the oestrogen receptor, is the primary driver of variation along PC2. A useful validation of the approach.

#####

Technically speaking, you can also transpose your input data to perform PC 'the other way', whereby the PCs represent your genes, and samples comprise the loadings. In this case, you could simply take the first number of PCs that pass, say, 90% accumulative explained variation and use these as input for downstream programs, including regression modeling.

This second approach may actually be more amenable to what you are aiming to do.

Kevin