and i want to run a PCA analysis on them. TIll now every tutorial i read is having as first step the transposition of the array in order for columns to become rows and rows columns. By following them, I'm getting a plot where dots are having the column names (Control_1 , Control_2 , Cancer_1 etc.) for labels while the eigenvector are represented from gene names (gene1,gene2 gene3 etc..).
What i actually want to do is the opposite one. I want Control_1 , Control_2 , Cancer_1, Cancer_2 and Cancer_3 to be my eigenvectors and dots to be the expression values of the genes. In that way i want to see if for example expressions of some genes in Cancer mode are grouped together. After trying many different ideas finally i couldn't figure out how to achieve that.
Here I post also, the code I used to produce the first plot that i described
# transpose the data frame
pcaData = as.data.frame(t(pcaData))# add new column with the type of experiment (Control ,Cancer)
pcaData["type"]= c(rep("Control",10),rep("Cancer",13))
autoplot(prcomp(pcaData[,1:23]),
data = pcaData,
colour ='type',
label = TRUE,
label.size = 3,
loadings.label = TRUE,
loadings.label.size = 3
)
So how can i compute the opposite PCA ? Is the transposition of the matrix needed or not ? Any idea,hint or resource on how to approach such a target will be very helpful.
Why don't you use row hierarchical clustering if you are only interested in seeing the cluster of genes that discriminate Control Vs Cancer? PC component analysis is usually done for reducing the dimensions of a multi-dimensional problem and then project it in 2 principal axes to see the effect. For example, if you have 1000 genes and 10 samples, it is difficult to visualize how the samples differ according to all 1000 genes, but if project along 2-PC-axes where the variation is maximum, you might clearly see how they vary. And you will have 10 points on the PC-place corresponding to 10-samples. Now if you do "opposite" PC, you will get 1000 points of genes; that I don't know solves or complicates your problem even further!
Yes you are right that if you have 18 thousand genes there must be a mess. But what if you have only 100 or 200 genes after a differential expression analysis ? I think that this would be a more clear plot. Anyway. The think is if such a plot is possible to be created.
I don't see any problem in creating such a plot. Conceptually, and following my earlier analogy, you are trying to plot 1000 points in a 10-dimensional space (instead of 10-points in 1000 dimensional space). Although how much PCA can resolve the difference among these 1000 points (ie. the difference explained by each principal component) has to be checked. My guess is that it will resolve very little difference. And I'll still suggest hierarchical clustering of genes.
I promise to try the hierarchical clustering :-). As for this specific link i want to tell you that i have already saw it. The difference with that dataset and mine is that iris dataset doesn't seem to have any replication for its samples and also has a last column called 'species' that is used to distinguish the groups later with the color. I don't have such column. So the problem is how to create such a data frame (like iris) with my dataset.
the genes are in column and Cancer / Caontrol are in rows for above to
work
doesn't mean that a transposition of the initial data frame is needed ? Anyway. To be clear i got a little bit more confused now. Either you did what I have already done and posted at the initial post or you did something that i couldn't understand. So if you have time and you want please, post a more completed code. And thanks a lot for this conversation :-)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In this case, you can't color them according to the samples (ca/co), as there is no natural grouping of genes according to samples : all genes belong to all samples. On other hand, you can define a different groupinng on genes (eg, different pathways), and color on that grouping.
Case2: Principal componenets assuming genes as variables; The samples
will get plotted on the PC-axes
In this case, the samples belong to natural grouping of control / cancer. So you can color on them using the type variable.
Related Question: Why do we transpose the data when doing PCA?
Usually, the convention is that the columns are variables and rows are observations (I know that at least SPSS follows this convention very stricty). For each observation (row), we are measuring the value of all variables (cols). For example, let's see the mpg data
data(mpg)> head(mpg[,1:8])# looking only first 8 vars for better formatting# A tibble: 6 × 8
manufacturer model displ year cyl trans drv cty
<chr><chr><dbl><int><int><chr><chr><int>
1 audi a4 1.8 1999 4 auto(l5) f 18
2 audi a4 1.8 1999 4 manual(m5) f 21
3 audi a4 2.0 2008 4 manual(m6) f 20
4 audi a4 2.0 2008 4 auto(av) f 21
5 audi a4 2.8 1999 6 auto(l5) f 16
There are 5 observations (5 different autos) here, for which all the variables like model, year etc is being measured.
Now, the way biological data are presented is that the observations are usually in columns (different samples/patients), and the variables are in rows (genes). This is a convenient representation because we usually have much more variables (genes) than observations (samples). But this necessitates transposing the data, if we want the variables to be in natural form (i.e genes in columns / observations or samples in rows.)
I'm unable to embed the 2nd PCA plot - please follow the link instead. Also reached the max charcters limit (didn't know that it exist !!!). So some of the rows have been removed from head output.
just to let you know, there is a very nice R package "FactoMineR" which produces all in one of what you need with axis statistics.
You can either display "Individual Plot" : your samples, or "Variables": your genes, or both together.
What's more it endles PCA, MCA, MFA, FAMD, hierarchical clustering, etc..
It also has extensive ways to impute missing data, I use it everyday, very robust.
FactoMineR
Why don't you use row hierarchical clustering if you are only interested in seeing the cluster of genes that discriminate Control Vs Cancer? PC component analysis is usually done for reducing the dimensions of a multi-dimensional problem and then project it in 2 principal axes to see the effect. For example, if you have 1000 genes and 10 samples, it is difficult to visualize how the samples differ according to all 1000 genes, but if project along 2-PC-axes where the variation is maximum, you might clearly see how they vary. And you will have 10 points on the PC-place corresponding to 10-samples. Now if you do "opposite" PC, you will get 1000 points of genes; that I don't know solves or complicates your problem even further!
Yes you are right that if you have 18 thousand genes there must be a mess. But what if you have only 100 or 200 genes after a differential expression analysis ? I think that this would be a more clear plot. Anyway. The think is if such a plot is possible to be created.
I don't see any problem in creating such a plot. Conceptually, and following my earlier analogy, you are trying to plot 1000 points in a 10-dimensional space (instead of 10-points in 1000 dimensional space). Although how much PCA can resolve the difference among these 1000 points (ie. the difference explained by each principal component) has to be checked. My guess is that it will resolve very little difference. And I'll still suggest hierarchical clustering of genes.
Ok, now coming to your problem: You can follow exactly this https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html
You don't need to transpose the matrix. Try to see the similarity between the iris data plotted there and your own data.
I promise to try the hierarchical clustering :-). As for this specific link i want to tell you that i have already saw it. The difference with that dataset and mine is that iris dataset doesn't seem to have any replication for its samples and also has a last column called 'species' that is used to distinguish the groups later with the color. I don't have such column. So the problem is how to create such a data frame (like iris) with my dataset.
Hmm, i see your problem now.. So the name of your Cancer / Control should be rownames.
Obviously, the genes are in column and Cancer / Caontrol are in rows for above to work.
By saying
doesn't mean that a transposition of the initial data frame is needed ? Anyway. To be clear i got a little bit more confused now. Either you did what I have already done and posted at the initial post or you did something that i couldn't understand. So if you have time and you want please, post a more completed code. And thanks a lot for this conversation :-)
yes, you are right, I'm messing up everything :) I'll post complete code once my head is free a bit, as I think I am getting the idea what you want.