Extracting information from Principal component analysis
5
3
Entering edit mode
8.2 years ago
Diana ▴ 900

Hello everyone!

I'm doing PCA (principal Component Analysis) on a set of 1000 genes in 4 different samples to see if there's any split in the data. My data looks like this:

id       sample1     sample2     sample3     sample4
gene1    2           0           1           1
gene2    1           2           0           3
gene3    2           2           4           2
gene4    3           1           7           0


My code is very simple:

data<-read.csv("exp.csv")

matrix<-data.matrix(data)

pca<- prcomp(matrix[,2:4], scale.=T)

library(ggplot2)

# create data frame with scores
scores = as.data.frame(pca\$x)

# plot of observations
ggplot(data = scores, aes(x = PC1, y = PC2, label = rownames(exp))) +
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
geom_text(colour = "tomato", alpha = 0.8, size = 4) +
ggtitle("PCA plot")


When I plot PC1 and PC2 I clearly see a separation so the genes are divided into 2 big groups but how can I see what the constituent genes of these 2 clusters are? because in the plot lots of genes overlap with each other and therefore its difficult to make out the gene names just from the plot. How can I extract these from PCA results and save it as a text file?

EDIT: For the above code, can someone tell me as to how I can colour the dots in the plot according to the sample? I tried changing colour parameter in ggplot but its not working.

Thanks!!

PCA R RNA-Seq • 11k views
0
Entering edit mode

FYI, no one receives a notice when you edit a post. So the likelihood of someone responding to the edit when there are already answers present is low.

Regarding the edit, you can specify colors by adding a new column to the scores data.frame that contains either sample names or even just factor(c(1:nrow(scores))). Then specify that as the color (well, "colour", since it uses the british spelling).

0
Entering edit mode

Hi Devon...thanks for letting me know about the edit. I tried the factor(c(1:nrow(scores))) but that colours all the genes differently whereas I wanted to colour them based on the sample that the gene is most contributing to? In the final PCA plot I do see 2 big clusters of genes so I wanted to colour and see which sample each gene was coming from...

0
Entering edit mode

Ah, I see. The genes are coming from all of the samples at the same time, so it's unclear what you actually mean.

0
Entering edit mode

Sorry, maybe I didn't explain properly. Yes the genes are coming from all the samples at the same time but they have different values in each sample. So there is no way to colour them according to samples? like red to genes with most expression in sample1, green to those with most expression in sample2 and so on...?

0
Entering edit mode

Just create a vector with that information. You have a matrix of values, so just process it to determine what sample to assign it to.

0
Entering edit mode

OK. Thanks!

4
Entering edit mode
1
Entering edit mode
8.2 years ago

If you have a clear separation, then you can simply threshold the scores data.frame according to that. I don't recall if prcomp() adds row names to its output, but if not then things should be in the same order as the input.

0
Entering edit mode

Thanks it worked!

1
Entering edit mode
8.2 years ago

You could cluster the genes in PCA space i.e. use the scores as input to the clustering algorithm.

0
Entering edit mode
8.2 years ago
The ▴ 180
0
Entering edit mode
6.0 years ago

Hi, check FactoMineR, a very useful package for PCA (and MCA, MFA, FAMD...) in R. It gives great outputs, both statisticals and graphicals. FactoMineR