Question: Extracting information from Principal component analysis
3
gravatar for Diana
4.2 years ago by
Diana770
Germany
Diana770 wrote:

Hello everyone!

I'm doing PCA (principal Component Analysis) on a set of 1000 genes in 4 different samples to see if there's any split in the data. My data looks like this:
   

id sample1 sample2 sample3 sample4
gene1        2         0          1           1
gene2        1         2          0           3 
gene3        2         2         4           2
gene4        3         1          7           0

My code is very simple:

data<-read.csv("exp.csv")

matrix<-data.matrix(data)

pca<- prcomp(matrix[,2:4], scale.=T)

library(ggplot2)

# create data frame with scores
scores = as.data.frame(pca$x)

# plot of observations
ggplot(data = scores, aes(x = PC1, y = PC2, label = rownames(exp))) +
  geom_hline(yintercept = 0, colour = "gray65") +
  geom_vline(xintercept = 0, colour = "gray65") +
  geom_text(colour = "tomato", alpha = 0.8, size = 4) +
  ggtitle("PCA plot")

When I plot PC1 and PC2 I clearly see a separation so the genes are divided into 2 big groups but how can I see what the constituent genes of these 2 clusters are? because in the plot lots of genes overlap with each other and therefore its difficult to make out the gene names just from the plot. How can I extract these from PCA results and save it as a text file?

EDIT: For the above code, can someone tell me as to how I can colour the dots in the plot according to the sample? I tried changing colour parameter in ggplot but its not working.

 

Thanks!!

plots rna-seq pca R • 8.3k views
ADD COMMENTlink modified 2.1 years ago by benoit.tessoulin30 • written 4.2 years ago by Diana770

FYI, no one receives a notice when you edit a post. So the likelihood of someone responding to the edit when there are already answers present is low.

Regarding the edit, you can specify colors by adding a new column to the scores data.frame that contains either sample names or even just factor(c(1:nrow(scores))). Then specify that as the color (well, "colour", since it uses the british spelling).

ADD REPLYlink written 4.2 years ago by Devon Ryan89k

Hi Devon...thanks for letting me know about the edit. I tried the factor(c(1:nrow(scores))) but that colours all the genes differently whereas I wanted to colour them based on the sample that the gene is most contributing to? In the final PCA plot I do see 2 big clusters of genes so I wanted to colour and see which sample each gene was coming from...

ADD REPLYlink written 4.2 years ago by Diana770

Ah, I see. The genes are coming from all of the samples at the same time, so it's unclear what you actually mean.

ADD REPLYlink written 4.2 years ago by Devon Ryan89k

Sorry, maybe I didn't explain properly. Yes the genes are coming from all the samples at the same time but they have different values in each sample. So there is no way to colour them according to samples? like red to genes with most expression in sample1, green to those with most expression in sample2 and so on...?

ADD REPLYlink written 4.2 years ago by Diana770

Just create a vector with that information. You have a matrix of values, so just process it to determine what sample to assign it to.
 

ADD REPLYlink written 4.2 years ago by Devon Ryan89k

OK. Thanks!

ADD REPLYlink written 4.2 years ago by Diana770
4
gravatar for Jeremy Leipzig
4.2 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

the keyword you might be searching for is "loadings"

http://stackoverflow.com/questions/12760108/principal-components-analysis-how-to-get-the-contribution-of-each-paramete

ADD COMMENTlink written 4.2 years ago by Jeremy Leipzig18k
1
gravatar for Devon Ryan
4.2 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

If you have a clear separation, then you can simply threshold the scores data.frame according to that. I don't recall if prcomp() adds row names to its output, but if not then things should be in the same order as the input.

ADD COMMENTlink written 4.2 years ago by Devon Ryan89k

Thanks it worked!

ADD REPLYlink written 4.2 years ago by Diana770
1
gravatar for Jean-Karim Heriche
4.2 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche18k wrote:

You could cluster the genes in PCA space i.e. use the scores as input to the clustering algorithm.

ADD COMMENTlink written 4.2 years ago by Jean-Karim Heriche18k
0
gravatar for The
4.2 years ago by
The100
United States
The100 wrote:

Check if this helps:
http://stats.stackexchange.com/questions/115032/how-to-find-which-variables-are-most-correlated-with-the-first-principal-compone

ADD COMMENTlink written 4.2 years ago by The100
0
gravatar for benoit.tessoulin
2.1 years ago by
benoit.tessoulin30 wrote:

Hi, check FactoMineR, a very useful package for PCA (and MCA, MFA, FAMD...) in R. It gives great outputs, both statisticals and graphicals. FactoMineR

ADD COMMENTlink written 2.1 years ago by benoit.tessoulin30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 757 users visited in the last hour