Question

PCA on genes or samples?

2

Entering edit mode

21 months ago

ShowPow ▴ 20

Hi everybody ! I'm trying to use PCA over Rna-seq data just to understand PCA. I saw pcaExplorer and I did not quite understand what it actually does. I have a count matrix of 90 samples (healthy and cancer). I have then thousands of genes. I want to obtain the genes that differ my samples the most for further classification ( I know there are other techniques but I just wanted to exercise a bit, with PCA ). Thats what I have done :

Let z be the transpose of a count matrix, normalized with method TMM = samples as rows and genes as columns

z <- t(counts.tmm)
pca <- prcomp(z, center = TRUE,scale. = TRUE)
summary(pca)
pca.class <- sample.cond$class # here sample.cond$class will be HC or Pancreas (Tumor)
ggbiplot(pca,ellipse=TRUE,groups = pca.class,choices= c(1,2),var.axes=FALSE)
dev.off()

The result is the following :

Importance of components:
                           PC1     PC2     PC3     PC4     PC5     PC6    PC7     PC8     PC9    PC10    PC11   PC12
Standard deviation     22.6979 8.01951 6.89380 5.85445 5.39476 5.30267 5.2139 5.00639 4.99198 4.96105 4.85406 4.7934
Proportion of Variance  0.2937 0.03667 0.02709 0.01954 0.01659 0.01603 0.0155 0.01429 0.01421 0.01403 0.01343 0.0131
Cumulative Proportion   0.2937 0.33039 0.35749 0.37703 0.39362 0.40965 0.4251 0.43944 0.45365 0.46768 0.48111 0.4942
                          PC13    PC14    PC15    PC16    PC17    PC18    PC19    PC20    PC21    PC22    PC23   PC24
Standard deviation     4.74638 4.69116 4.66121 4.62983 4.56994 4.55543 4.50351 4.46740 4.44247 4.36019 4.32935 4.3111
Proportion of Variance 0.01284 0.01255 0.01239 0.01222 0.01191 0.01183 0.01156 0.01138 0.01125 0.01084 0.01069 0.0106
Cumulative Proportion  0.50706 0.51960 0.53199 0.54421 0.55612 0.56795 0.57951 0.59089 0.60214 0.61298 0.62367 0.6343
                          PC25    PC26    PC27    PC28    PC29    PC30    PC31    PC32    PC33    PC34    PC35    PC36
Standard deviation     4.26250 4.21480 4.15619 4.14941 4.04881 4.01388 3.94854 3.93874 3.91634 3.86539 3.81318 3.79119
Proportion of Variance 0.01036 0.01013 0.00985 0.00982 0.00935 0.00919 0.00889 0.00884 0.00874 0.00852 0.00829 0.00819
Cumulative Proportion  0.64462 0.65475 0.66460 0.67441 0.68376 0.69295 0.70183 0.71068 0.71942 0.72794 0.73623 0.74443
                          PC37    PC38    PC39    PC40    PC41    PC42    PC43    PC44    PC45    PC46    PC47    PC48
Standard deviation     3.77294 3.73427 3.72646 3.62194 3.59297 3.56278 3.51904 3.42127 3.38418 3.37216 3.32199 3.30519
Proportion of Variance 0.00812 0.00795 0.00792 0.00748 0.00736 0.00724 0.00706 0.00667 0.00653 0.00648 0.00629 0.00623
Cumulative Proportion  0.75254 0.76049 0.76841 0.77589 0.78325 0.79049 0.79755 0.80422 0.81075 0.81723 0.82352 0.82975
                          PC49   PC50    PC51    PC52    PC53    PC54    PC55    PC56    PC57    PC58    PC59    PC60
Standard deviation     3.27842 3.2430 3.21268 3.18443 3.17152 3.09140 3.05199 3.04186 3.03361 2.99736 2.98039 2.94492
Proportion of Variance 0.00613 0.0060 0.00588 0.00578 0.00573 0.00545 0.00531 0.00528 0.00525 0.00512 0.00506 0.00494
Cumulative Proportion  0.83588 0.8419 0.84776 0.85354 0.85928 0.86472 0.87003 0.87531 0.88056 0.88568 0.89074 0.89569
                          PC61    PC62    PC63   PC64    PC65    PC66    PC67    PC68    PC69    PC70    PC71    PC72
Standard deviation     2.92100 2.90968 2.86032 2.8395 2.82278 2.79180 2.74172 2.73852 2.70586 2.68556 2.64009 2.60152
Proportion of Variance 0.00486 0.00483 0.00466 0.0046 0.00454 0.00444 0.00429 0.00428 0.00417 0.00411 0.00397 0.00386
Cumulative Proportion  0.90055 0.90538 0.91004 0.9146 0.91918 0.92363 0.92791 0.93219 0.93636 0.94047 0.94445 0.94831
                          PC73    PC74    PC75    PC76    PC77    PC78   PC79    PC80    PC81    PC82    PC83    PC84
Standard deviation     2.58597 2.57208 2.52419 2.49683 2.45577 2.43039 2.4041 2.36481 2.31208 2.28884 2.24954 2.20218
Proportion of Variance 0.00381 0.00377 0.00363 0.00355 0.00344 0.00337 0.0033 0.00319 0.00305 0.00299 0.00289 0.00276
Cumulative Proportion  0.95212 0.95589 0.95952 0.96308 0.96652 0.96988 0.9732 0.97637 0.97941 0.98240 0.98529 0.98805
                          PC85   PC86    PC87    PC88    PC89      PC90
Standard deviation     2.18536 2.1373 2.06398 1.94600 1.88876 7.458e-15
Proportion of Variance 0.00272 0.0026 0.00243 0.00216 0.00203 0.000e+00
Cumulative Proportion  0.99077 0.9934 0.99581 0.99797 1.00000 1.000e+00

And the image looks like this :

enter image description here

My Questions are :

1) Why the PC are the samples? Shouldn't they be the genes? If I do var.axes=TRUE it actually shows me the arrows that are the genes so is correct but I did not get this point.

2) How can I get the genes that make my samples differ by condition the best from the PCA that i have computed?

If I do not transpose the matrix :

z <- counts.tmm # samples are in the columns and genes as rows
pca <- prcomp(z, center = TRUE,scale. = FALSE)
summary(pca)
ggbiplot(pca,ellipse=TRUE,choices= c(1,2),var.axes=FALSE) 
dev.off()

thats what I get :

Importance of components:
                           PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10    PC11    PC12
Standard deviation     13.0772 4.07904 2.49366 2.04542 1.82674 1.50391 1.46282 1.41308 1.39098 1.33582 1.32702 1.30732
Proportion of Variance  0.5983 0.05821 0.02175 0.01464 0.01167 0.00791 0.00749 0.00699 0.00677 0.00624 0.00616 0.00598
Cumulative Proportion   0.5983 0.65649 0.67825 0.69288 0.70456 0.71247 0.71996 0.72694 0.73371 0.73996 0.74612 0.75210
                          PC13    PC14    PC15    PC16    PC17    PC18    PC19    PC20    PC21    PC22    PC23    PC24
Standard deviation     1.29400 1.28199 1.26799 1.25235 1.24683 1.21349 1.19880 1.18708 1.18165 1.17765 1.16516 1.15687
Proportion of Variance 0.00586 0.00575 0.00562 0.00549 0.00544 0.00515 0.00503 0.00493 0.00488 0.00485 0.00475 0.00468
Cumulative Proportion  0.75795 0.76370 0.76933 0.77482 0.78025 0.78541 0.79043 0.79536 0.80025 0.80510 0.80985 0.81453
                          PC25    PC26    PC27   PC28    PC29    PC30    PC31    PC32    PC33    PC34    PC35    PC36
Standard deviation     1.15105 1.13993 1.12768 1.1218 1.11720 1.10388 1.09192 1.08421 1.07808 1.07271 1.06507 1.05828
Proportion of Variance 0.00464 0.00455 0.00445 0.0044 0.00437 0.00426 0.00417 0.00411 0.00407 0.00403 0.00397 0.00392
Cumulative Proportion  0.81917 0.82371 0.82816 0.8326 0.83693 0.84119 0.84537 0.84948 0.85354 0.85757 0.86154 0.86546
                          PC37   PC38    PC39    PC40   PC41    PC42    PC43    PC44    PC45    PC46   PC47    PC48    PC49
Standard deviation     1.05420 1.0415 1.03585 1.02542 1.0145 1.00792 0.99140 0.97666 0.97385 0.96267 0.9568 0.95232 0.94773
Proportion of Variance 0.00389 0.0038 0.00375 0.00368 0.0036 0.00355 0.00344 0.00334 0.00332 0.00324 0.0032 0.00317 0.00314
Cumulative Proportion  0.86935 0.8731 0.87689 0.88057 0.8842 0.88773 0.89117 0.89450 0.89782 0.90106 0.9043 0.90744 0.91058
                         PC50    PC51    PC52    PC53    PC54    PC55    PC56    PC57    PC58    PC59    PC60    PC61
Standard deviation     0.9416 0.93214 0.92504 0.91419 0.91151 0.89366 0.88537 0.88206 0.87722 0.86586 0.86438 0.85371
Proportion of Variance 0.0031 0.00304 0.00299 0.00292 0.00291 0.00279 0.00274 0.00272 0.00269 0.00262 0.00261 0.00255
Cumulative Proportion  0.9137 0.91672 0.91972 0.92264 0.92555 0.92834 0.93108 0.93381 0.93650 0.93912 0.94173 0.94428
                         PC62    PC63    PC64    PC65    PC66    PC67   PC68    PC69    PC70    PC71    PC72    PC73
Standard deviation     0.8449 0.84420 0.83856 0.83176 0.82325 0.80923 0.7924 0.78655 0.78025 0.77028 0.76573 0.75772
Proportion of Variance 0.0025 0.00249 0.00246 0.00242 0.00237 0.00229 0.0022 0.00216 0.00213 0.00208 0.00205 0.00201
Cumulative Proportion  0.9468 0.94928 0.95174 0.95416 0.95653 0.95882 0.9610 0.96318 0.96531 0.96738 0.96944 0.97144
                          PC74    PC75    PC76    PC77    PC78    PC79    PC80    PC81   PC82    PC83   PC84    PC85
Standard deviation     0.75455 0.74868 0.74169 0.73250 0.73053 0.72453 0.71426 0.70595 0.6973 0.68962 0.6761 0.66666
Proportion of Variance 0.00199 0.00196 0.00192 0.00188 0.00187 0.00184 0.00178 0.00174 0.0017 0.00166 0.0016 0.00155
Cumulative Proportion  0.97344 0.97540 0.97732 0.97920 0.98107 0.98290 0.98469 0.98643 0.9881 0.98980 0.9914 0.99295
                          PC86   PC87    PC88    PC89    PC90
Standard deviation     0.66129 0.6555 0.64137 0.61824 0.59554
Proportion of Variance 0.00153 0.0015 0.00144 0.00134 0.00124
Cumulative Proportion  0.99448 0.9960 0.99742 0.99876 1.00000

enter image description here

But the list of PC is always 90. Now here it makes sense that they are 90 (samples) but before I don't get why is 90 when the genes are thousands.

4) Last question , for my purpose (obtain genes that shows more difference between my samples), is it more suitable the first (PC as genes) or the second (PC as samples)?

From pcaExplorer I have seen that they relate this last PCA (samples as columns and genes as rows) with the study of the Heatmap so I assume that this last one is the one to use for gene view? But I don't get why.

PCA R plot edgeR • 849 views

ADD COMMENT • link updated 21 months ago by Jesse ▴ 740 • written 21 months ago by ShowPow ▴ 20

score 4 · Accepted Answer · 2022-07-28

Not an expert by any means, but I have a couple ideas for you.

About your data and the genes-vs-samples thing, I think you're right that you want counts per gene being the components, not counts per sample. But I think you already have that when your samples are on rows-- it's just hard to see in the output, especially because the number of components you see there is throwing you off. You can look at pca$x to see the actual transformed data.

Here's a stupid example with ten samples and three "measurements" where there's really only one dimension.

> x <- matrix(1:30, ncol=3)
> x
      [,1] [,2] [,3]
 [1,]    1   11   21
 [2,]    2   12   22
 [3,]    3   13   23
 [4,]    4   14   24
 [5,]    5   15   25
 [6,]    6   16   26
 [7,]    7   17   27
 [8,]    8   18   28
 [9,]    9   19   29
[10,]   10   20   30
> pca <- prcomp(x)
> summary(pca)
Importance of components:
                         PC1      PC2       PC3
Standard deviation     5.244 3.55e-16 6.418e-32
Proportion of Variance 1.000 0.00e+00 0.000e+00
Cumulative Proportion  1.000 1.00e+00 1.000e+00

It rotated it so PC1 has all the variation (and the others essentially none, since this only requires one dimension to fully explain). If you print the object you see that rotation:

> pca
Standard deviations (1, .., p=3):
[1] 5.244044e+00 3.549628e-16 6.417917e-32

Rotation (n x k) = (3 x 3):
           PC1        PC2        PC3
[1,] 0.5773503 -0.8164966  0.0000000
[2,] 0.5773503  0.4082483 -0.7071068
[3,] 0.5773503  0.4082483  0.7071068

And then pca$x contains the same ten points, with the rotation applied, where they all lie along that line:

> pca$x
             PC1           PC2           PC3
 [1,] -7.7942286  2.220446e-15 -4.440892e-16
 [2,] -6.0621778  1.776357e-15 -4.440892e-16
 [3,] -4.3301270  1.110223e-15 -2.220446e-16
 [4,] -2.5980762  7.771561e-16 -2.220446e-16
 [5,] -0.8660254  2.498002e-16 -5.551115e-17
 [6,]  0.8660254 -2.498002e-16  5.551115e-17
 [7,]  2.5980762 -7.771561e-16  2.220446e-16
 [8,]  4.3301270 -1.110223e-15  2.220446e-16
 [9,]  6.0621778 -1.776357e-15  4.440892e-16
[10,]  7.7942286 -2.220446e-15  4.440892e-16

If you give input that's wider than it is long (more genes than samples), you should see that the number of rows of the output is still what you expect, and it's just your columns that are fewer than you might expect.

> x2 <- matrix(1:50, ncol=10)
> x2
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    6   11   16   21   26   31   36   41    46
[2,]    2    7   12   17   22   27   32   37   42    47
[3,]    3    8   13   18   23   28   33   38   43    48
[4,]    4    9   14   19   24   29   34   39   44    49
[5,]    5   10   15   20   25   30   35   40   45    50
> pca2 <- prcomp(x2)
> pca2$x
           PC1           PC2           PC3           PC4           PC5
[1,] -6.324555 -2.498002e-15  4.440892e-16  4.440892e-16  4.440892e-16
[2,] -3.162278 -1.249001e-15  2.220446e-16  2.220446e-16  2.220446e-16
[3,]  0.000000  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
[4,]  3.162278  1.249001e-15 -2.220446e-16 -2.220446e-16 -2.220446e-16
[5,]  6.324555  2.498002e-15 -4.440892e-16 -4.440892e-16 -4.440892e-16

Thinking it through with a very small number of samples and dimensions to make my brain hurt less, this makes sense. If you had just two points, you would never need more than two axes (actually, one, right?) to fully define the difference between them. So you'll never need more components in the output than you have samples going in.

So long story short, give your samples on rows and genes on columns.

P.S. Be careful juggling different matrices and data frames and such. ggbiplot would have complained about the transposed case, except that happens that the length of the groups happened to match up. (I usually aim for putting more into data frames wherever possible and then subsetting/slicing them as needed, just because of how often that sort of thing comes up in R.)