Question

performing a PCA on a gene set in R

3

Entering edit mode

6.4 years ago

salvatore.raieli2 ▴ 90

Hello everyone,

I want to perform a PCA on a dataset, I used this example to try:

Dataset example

the col names are representing my samples (3 for the controls, 3 for the drug treatment). the rows are the genes I tried to use this:

 pca = prcomp(test)
 pca
 ggbiplot(pca)

with this result:

PCA obtained

what I would like to do is a PCA where each dot represent a sample and to color them according CTLR/DRUG.

thank you for your help.

R • 15k views

ADD COMMENT • link updated 6.4 years ago by cpad0112 21k • written 6.4 years ago by salvatore.raieli2 ▴ 90

1

Entering edit mode

Follow @Kevin's guides in:
PCA plot from read count matrix from RNA-Seq
PCA in a RNA seq analysis

ADD REPLY • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

Thanks!

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

We cannot see the resulting plot. could you edit your question.

Anyway if you have a dataframe as :

group  PCA1   PCA2

where group is either "CTLR" or "DRUG". You can do :

ggplot(df,aes(x=PCA1,y=PCA2,col=group))+geom_point()

ADD REPLY • link 6.4 years ago by Nicolas Rosewick 11k

0

Entering edit mode

salvatore.raieli2 : You appear to have linked the same image in both locations.

ADD REPLY • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

I edited, do you visualized the two images correctly? the first image is a link for the dataset image.

ADD REPLY • link 6.4 years ago by salvatore.raieli2 ▴ 90

0

Entering edit mode

Looks good now.

ADD REPLY • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

I tryed to do this:

data(iris)
# log transform 
log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]

# apply PCA - scale. = TRUE is highly 
# advisable, but default is FALSE. 
ir.pca <- prcomplog.ir,
                 center = TRUE,
                 scale. = TRUE) 

library(ggbiplot)
g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, 
              groups = ir.species, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)

I have applied this in my case:

# I transposed
test <- t(test)

#then I created a vector called x, since transposing is putting as matrix row names the col names:
x <- rownames(test)

then I followed as before:
g <- ggbiplot(pca, obs.scale = 1, var.scale = 1, 
              groups = x, ellipse = TRUE, 
              circle = F, var.axes = F)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
g

if I do not put groups = x I have this PCA results, but they are not colored as different samples:

PCA

otherwise I get this error:

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric.

How I can solve this?

ADD REPLY • link 6.4 years ago by salvatore.raieli2 ▴ 90

0

Entering edit mode

Please post your example data as data (not as image). colmeans(as.numeric(rownames(x)) should work assuming that row. names are numbers stored as chars.

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

salvatore.raieli2 : Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

Submit Answer should only be used for new answers for the original question.

ADD REPLY • link 6.4 years ago by GenoMax 141k

score 9 · Answer 1 · 2017-12-13

9

Entering edit mode

6.4 years ago

cpad0112 21k

Simulate some data for drug and ctrl:

normal_data=sapply(seq(1:3), function(x) x=rnorm(10,6,1))
tumor_data=sapply(seq(1:3), function(x) x=rnorm(10,10,2))
data=cbind(normal_data, tumor_data)
genes=paste("gene",seq(1:10), sep="")
row.names(data)=genes
colnames(data)=c(paste0("ctrl_",seq(1:3)), paste0("drug_",seq(1:3)))
condition=rep(c("ctrl","drug"), each=3)

Do PCA on dataframe:

pca_data=prcomp(t(data))

Do some calculation (% variance covered by component to show on X and Y axis for PC1 and PC2)

pca_data_perc=round(100*pca_data$sdev^2/sum(pca_data$sdev^2),1)

Create dataframe with PC1 and PC2 with some metadata:

df_pca_data=data.frame(PC1 = pca_data$x[,1], PC2 = pca_data$x[,2], sample = colnames(data), condition=condition)

Plot using ggplot2 (color by each sample):

library(ggplot2)
    ggplot(df_pca_data, aes(PC1,PC2, color = sample))+
  geom_point(size=8)+
  labs(x=paste0("PC1 (",pca_data_perc[1],")"), y=paste0("PC2 (",pca_data_perc[2],")"))

Plot using ggplot2 (color by each group):

 ggplot(df_pca_data, aes(PC1,PC2, color = condition))+
      geom_point(size=8)+
      labs(x=paste0("PC1 (",pca_data_perc[1],")"), y=paste0("PC2 (",pca_data_perc[2],")"))

ADD COMMENT • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

Im trying to make this condition data frame these are my sample names

 "H1"    "H2"    "H3"    "H4"    "M2"    "M3"    "M4"    "C1"    "C2"    "C3"    "C4"   
     "G2"    "G3"    "G4"    "Gran1" "Gran2"

This is the order how columns are arranged where HSC is [H1-H4} , Monocyte is [M2-M4] , CMP is [C2-C4] ,GMP is [G2-G4], and Granulocyte is [Gran1- Gran2]

I want to do this

 condition=rep(c("ctrl","drug"), each=3)

but my sample are variable in number , as you can see I want to make them as test vs control or healthy vs control as i would be including AML sample as well

Any help would be appreciated .

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Please furnish grouping information for AML,drug, normal.

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

This is the order how columns are arranged where HSC is [H1-H4} , Monocyte is [M2-M4] , CMP is [C2-C4] ,GMP is [G2-G4], and Granulocyte is [Gran1- Gran2]

Rest I have AML samples like LSC,Blast, each 8 sample each .

So HSC is [H1-H4} , Monocyte is [M2-M4] , CMP is [C2-C4] ,GMP is [G2-G4], and Granulocyte is [Gran1- Gran2] this would be my control

AML samples like LSC,Blast, each 8 sample each this would be my disease or AML sample

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

samples=c("H1", "H2","H3","H4","M2","M3","M4","C1","C2","C3","C4" ,"G2","G3","G4","Gran1","Gran2", rep(c("LSC","Blast"), each=8))
condition=rep(c("normal","disease/AML"), each=16)
df_final=data.frame(samples, condition)

> df_final
   samples   condition
1       H1      normal
2       H2      normal
3       H3      normal
4       H4      normal
5       M2      normal
6       M3      normal
7       M4      normal
8       C1      normal
9       C2      normal
10      C3      normal
11      C4      normal
12      G2      normal
13      G3      normal
14      G4      normal
15   Gran1      normal
16   Gran2      normal
17     LSC disease/AML
18     LSC disease/AML
19     LSC disease/AML
20     LSC disease/AML
21     LSC disease/AML
22     LSC disease/AML
23     LSC disease/AML
24     LSC disease/AML
25   Blast disease/AML
26   Blast disease/AML
27   Blast disease/AML
28   Blast disease/AML
29   Blast disease/AML
30   Blast disease/AML
31   Blast disease/AML
32   Blast disease/AML

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

I ran into this error because I have different sample number as you showed ,may be im doing something wrong which I can;t get it

samples=c("H1","H2","H3","H4", rep(c("Blast",10),rep("LSC",8)))


Error in rep(c("Blast", 10), rep("LSC", 8)) : invalid 'times' argument
In addition: Warning message:
NAs introduced by coercion

ADD REPLY • link 6.2 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

Dear kind Sir, please try this instead:

c("H1","H2","H3","H4", rep("Blast",10),rep("LSC",8))

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

that was really a stupid thing to ask ..for..i see what i was doing wrong

ADD REPLY • link 6.2 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

No problem Sire

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

just out of curiosity i used this factorminer library for the same data set i get results such as Dim 1 and DIm 2 ,is it the same what comes in the prcomp function when we plot the Principal components , if that is the case then when I using the above code given in this thread I get PC1 30 where as in case of factorminer then i get 83 %....am bit confused now...

ADD REPLY • link 6.2 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

'Dimension' is a term usually used in relation to k-means clustering and multidimensional scaling. How are you running the command?

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

just the way they have given in their vignette im posting the code..

library(FactoMineR)
library(factoextra)
X<- read.csv('UNIQUE_EPI.txt',header = T,row.names = 1)

res.pca <- PCA(X, graph = FALSE)
print(res.pca)

eig.val <- get_eigenvalue(res.pca)
eig.val
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))
var <- get_pca_var(res.pca)
var

fviz_pca_var(res.pca, col.var = "black")

ADD REPLY • link 6.2 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Okay, I do not see any reason to suggest that 'dimension' and 'principal component' are not the same thing here, i.e., do not worry about the name 'dimension'.

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

okay but the PC1 and DIM one are varying considerably is that normal? I mean it could change the biological meaning as well...PC1 i see 30 % but DIM 1 i see 84% ..

ADD REPLY • link 6.2 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Can you show the source exactly of these figures?

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

image from factorminer pca factorminer

image from pca prcomp prcomp

ADD REPLY • link 6.2 years ago by 1769mkc ★ 1.2k