PCA with respect to observations and not to features in R

4

Entering edit mode

8.3 years ago

arronar ▴ 290

Hi.

First of all i don't know if the title of the post is quite right but i couldn't though how to express it.

So let's assume that i have this table with expression results from two experiments (Control (10x) and Cancer (13x))

| Ge/treat |   Control_1   | Control_2 |  Cancer_1 | Cancer_2 | Cancer_3 | 
|----------|:-------------:|----------:|----------:|---------:|---------:|
| gene1    |       2.65    |    3.01   |   2.20    |  3.65    |   4.01   |
| gene2    |       1.54    |    1.27   |   2.01    |  2.65    |   3.11   |
| gene3    |       1.34    |    1.00   |   2.50    |  1.65    |   2.01   |

and i want to run a PCA analysis on them. TIll now every tutorial i read is having as first step the transposition of the array in order for columns to become rows and rows columns. By following them, I'm getting a plot where dots are having the column names (Control_1 , Control_2 , Cancer_1 etc.) for labels while the eigenvector are represented from gene names (gene1,gene2 gene3 etc..).

What i actually want to do is the opposite one. I want Control_1 , Control_2 , Cancer_1, Cancer_2 and Cancer_3 to be my eigenvectors and dots to be the expression values of the genes. In that way i want to see if for example expressions of some genes in Cancer mode are grouped together. After trying many different ideas finally i couldn't figure out how to achieve that.

Here I post also, the code I used to produce the first plot that i described

# transpose the data frame
pcaData = as.data.frame(t(pcaData))
# add new column with the type of experiment (Control ,Cancer)
pcaData["type"] = c(rep("Control",10),rep("Cancer",13))

autoplot(prcomp(pcaData[,1:23]), 
         data = pcaData, 
         colour = 'type', 
         label = TRUE, 
         label.size = 3,
         loadings.label = TRUE, 
         loadings.label.size = 3
        )

So how can i compute the opposite PCA ? Is the transposition of the matrix needed or not ? Any idea,hint or resource on how to approach such a target will be very helpful.

Thank you.

pca R plot • 8.0k views

ADD COMMENT • link updated 4.3 years ago by benoit.tessoulin ▴ 30 • written 8.3 years ago by arronar ▴ 290

0

Entering edit mode

Why don't you use row hierarchical clustering if you are only interested in seeing the cluster of genes that discriminate Control Vs Cancer? PC component analysis is usually done for reducing the dimensions of a multi-dimensional problem and then project it in 2 principal axes to see the effect. For example, if you have 1000 genes and 10 samples, it is difficult to visualize how the samples differ according to all 1000 genes, but if project along 2-PC-axes where the variation is maximum, you might clearly see how they vary. And you will have 10 points on the PC-place corresponding to 10-samples. Now if you do "opposite" PC, you will get 1000 points of genes; that I don't know solves or complicates your problem even further!

ADD REPLY • link 8.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

Yes you are right that if you have 18 thousand genes there must be a mess. But what if you have only 100 or 200 genes after a differential expression analysis ? I think that this would be a more clear plot. Anyway. The think is if such a plot is possible to be created.

ADD REPLY • link 8.3 years ago by arronar ▴ 290

0

Entering edit mode

I don't see any problem in creating such a plot. Conceptually, and following my earlier analogy, you are trying to plot 1000 points in a 10-dimensional space (instead of 10-points in 1000 dimensional space). Although how much PCA can resolve the difference among these 1000 points (ie. the difference explained by each principal component) has to be checked. My guess is that it will resolve very little difference. And I'll still suggest hierarchical clustering of genes.

Ok, now coming to your problem: You can follow exactly this https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html

You don't need to transpose the matrix. Try to see the similarity between the iris data plotted there and your own data.

ADD REPLY • link 8.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

I promise to try the hierarchical clustering :-). As for this specific link i want to tell you that i have already saw it. The difference with that dataset and mine is that iris dataset doesn't seem to have any replication for its samples and also has a last column called 'species' that is used to distinguish the groups later with the color. I don't have such column. So the problem is how to create such a data frame (like iris) with my dataset.

ADD REPLY • link 8.3 years ago by arronar ▴ 290

0

Entering edit mode

Hmm, i see your problem now.. So the name of your Cancer / Control should be rownames.

rownames(pcaData) = c(rep("C",10),rep("Cr",13))

Obviously, the genes are in column and Cancer / Caontrol are in rows for above to work.

ADD REPLY • link 8.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

By saying

the genes are in column and Cancer / Caontrol are in rows for above to work

doesn't mean that a transposition of the initial data frame is needed ? Anyway. To be clear i got a little bit more confused now. Either you did what I have already done and posted at the initial post or you did something that i couldn't understand. So if you have time and you want please, post a more completed code. And thanks a lot for this conversation :-)

ADD REPLY • link 8.3 years ago by arronar ▴ 290

0

Entering edit mode

yes, you are right, I'm messing up everything :) I'll post complete code once my head is free a bit, as I think I am getting the idea what you want.

ADD REPLY • link 8.3 years ago by Santosh Anand 5.8k

6

Entering edit mode

8.3 years ago

Santosh Anand 5.8k

As promised, here is the code

	set.seed(100)

	d = data.frame(matrix(runif(150), nrow=30, ncol=5)) # 30 x 5 data from uniform distribution
	colnames(d) = c("co1", "co2", "ca1", "ca2", "ca3") # 2 Controls / 3 Cancers
	rownames(d) = paste0("g", 1:30) # gene names

	## Case1: Principal componenets assuming samples (ca / co) as variables; The genes will get plotted on the PC-axes
	## I've removed the loading levels for a clear picture.
	head(d)
	autoplot(prcomp(d),
	data = d,
	label = TRUE,
	label.size = 4
	)

	## Case2: Principal componenets assuming genes as variables; The samples will get plotted on the PC-axes
	dt = as.data.frame(t(d))
	dt["type"] = c("co", "co", "ca", "ca", "ca")
	head(dt)

	autoplot(prcomp(dt[,1:30]),
	data = dt,
	colour = 'type',
	label = TRUE,
	label.size = 4
	)

view raw biostars_244570.R hosted with ❤ by GitHub

Case1: Principal componenets assuming samples (ca / co) as variables; The genes will get plotted on the PC-axes

> head(d)
          co1       co2       ca1       ca2        ca3
g1 0.30776611 0.4883060 0.4637012 0.9069544 0.39666235
g2 0.25767250 0.9285051 0.6471012 0.2098168 0.39270226
g3 0.55232243 0.3486920 0.9605731 0.3581380 0.47255689
g4 0.05638315 0.9541577 0.6763982 0.4482991 0.58365691
g5 0.46854928 0.6952741 0.4451480 0.9064264 0.35238717

enter image description here

In this case, you can't color them according to the samples (ca/co), as there is no natural grouping of genes according to samples : all genes belong to all samples. On other hand, you can define a different groupinng on genes (eg, different pathways), and color on that grouping.

Case2: Principal componenets assuming genes as variables; The samples will get plotted on the PC-axes

> head(dt)
           g1        g2        g3         g4        g5         g6
co1 0.3077661 0.2576725 0.5523224 0.05638315 0.4685493 0.48377074
co2 0.4883060 0.9285051 0.3486920 0.95415771 0.6952741 0.88945354
ca1 0.4637012 0.6471012 0.9605731 0.67639817 0.4451480 0.35777378
ca2 0.9069544 0.2098168 0.3581380 0.44829914 0.9064264 0.38943930
ca3 0.3966623 0.3927023 0.4725569 0.58365691 0.3523872 0.02847401
           g7        g8         g9       g10       g11       g12
co1 0.8124026 0.3703205 0.54655860 0.1702621 0.6249965 0.8821655
co2 0.1804072 0.6293909 0.98956414 0.1302889 0.3306605 0.8651205
ca1 0.4557315 0.4454140 0.24509259 0.6943507 0.4122370 0.3277259
ca2 0.5174597 0.1252391 0.03014575 0.7718055 0.3274151 0.3894787
ca3 0.9950879 0.9575585 0.55165118 0.1019540 0.2379151 0.8600977
           g13       g14       g15        g16       g17        g18
co1 0.28035384 0.3984879 0.7625511 0.66902171 0.2046122 0.35752485
co2 0.77758444 0.8273034 0.6033244 0.49123182 0.7803585 0.88422703
ca1 0.57256477 0.9669991 0.6617790 0.62469772 0.8566530 0.77477889
ca2 0.04105275 0.3613966 0.5709781 0.68488024 0.9711167 0.70195881
ca3 0.73826129 0.4973123 0.5799211 0.01618339 0.4721299 0.04238973
          g19        g20       g21        g22        g23       g24
co1 0.3594751 0.69029053 0.5358112 0.71080385 0.53834870 0.7489722
co2 0.2077139 0.30708590 0.3305298 0.19867907 0.23569430 0.2748867
ca1 0.8340271 0.09151028 0.4595255 0.59939816 0.91972191 0.9828241
ca2 0.0115455 0.53553213 0.8365715 0.80686674 0.08093264 0.2389093
ca3 0.4631544 0.62999477 0.6732493 0.08703467 0.14358380 0.9084785
           g25        g26       g27        g28       g29       g30 type
co1 0.42010145 0.17142021 0.7703016 0.88195359 0.5490967 0.2777238   co
co2 0.59132105 0.25339065 0.1234872 0.22990589 0.5975753 0.2114086   co
ca1 0.03780258 0.57793740 0.7333142 0.24874240 0.3007365 0.7334667   ca
ca2 0.96598986 0.03767233 0.9164064 0.72622658 0.2007762 0.8402007   ca
ca3 0.12253259 0.72886279 0.9503772 0.04335591 0.0195993 0.1990725   ca

2nd PCA

In this case, the samples belong to natural grouping of control / cancer. So you can color on them using the type variable.

Related Question: Why do we transpose the data when doing PCA?

Usually, the convention is that the columns are variables and rows are observations (I know that at least SPSS follows this convention very stricty). For each observation (row), we are measuring the value of all variables (cols). For example, let's see the mpg data

data(mpg)
> head(mpg[,1:8]) # looking only first 8 vars for better formatting
# A tibble: 6 × 8
  manufacturer model displ  year   cyl      trans   drv   cty
         <chr> <chr> <dbl> <int> <int>      <chr> <chr> <int>
1         audi    a4   1.8  1999     4   auto(l5)     f    18
2         audi    a4   1.8  1999     4 manual(m5)     f    21
3         audi    a4   2.0  2008     4 manual(m6)     f    20
4         audi    a4   2.0  2008     4   auto(av)     f    21
5         audi    a4   2.8  1999     6   auto(l5)     f    16

There are 5 observations (5 different autos) here, for which all the variables like model, year etc is being measured.

Now, the way biological data are presented is that the observations are usually in columns (different samples/patients), and the variables are in rows (genes). This is a convenient representation because we usually have much more variables (genes) than observations (samples). But this necessitates transposing the data, if we want the variables to be in natural form (i.e genes in columns / observations or samples in rows.)

ADD COMMENT • link 8.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

I'm unable to embed the 2nd PCA plot - please follow the link instead. Also reached the max charcters limit (didn't know that it exist !!!). So some of the rows have been removed from head output.

ADD REPLY • link 8.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

@Santosh, thank you. This certainly among the best answers I've read on Biostars.

I'd like to add a tiny extension for transposition of tibbles originally from stackoverflow, assuming genenames being the name of the first column:

dt <- d %>% 
  gather(var, val, 2:ncol(d)) %>%
  spread(genenames, val)

ADD REPLY • link 5.1 years ago by Carambakaracho ★ 3.3k

3

Entering edit mode

8.3 years ago

benoit.tessoulin ▴ 30

Hi,

just to let you know, there is a very nice R package "FactoMineR" which produces all in one of what you need with axis statistics. You can either display "Individual Plot" : your samples, or "Variables": your genes, or both together. What's more it endles PCA, MCA, MFA, FAMD, hierarchical clustering, etc..

It also has extensive ways to impute missing data, I use it everyday, very robust. FactoMineR

ADD COMMENT • link 4.3 years ago by benoit.tessoulin ▴ 30

Login before adding your answer.