Question

Filtering gene expression correlation matrix in R

1

Entering edit mode

9.5 years ago

mjoyraj ▴ 80

I have a 120 by 120 correlation matrix (Correlation_mat) obtained by comparing FPKM gene expression data of 120 genes at different time-series with cor test of R. I want to visualize the correlation results in histogram and then want to filter the top 1% correlations from the matrix. Then I want to divide the top 1% correlations into cluster based on mutual correlations. Can anybody help me with the R script?

EDIT: Moved additional information provided by OP from a comment

1. I have generated correlation matrix with the below scripts

#Data import
Rdata <- read.table(file.choose(), header=TRUE, sep=",")

#Generating correlation matrix with the 'Rdata'
cor(Rdata, method="pearson")
Correlation_mat <- cor(Rdata)

2. Then I have filtered the correlations based on a cutoff correlation coefficient (r) value

#Filtering based on Correlation value
n <- ncol(Correlation_mat)
cmat <- col(Correlation_mat)
ind <- order(-cmat, Correlation_mat, decreasing = TRUE) - (n * cmat - n)
dim(ind) <- dim(Correlation_mat)
colnames(ind) <- colnames(Correlation_mat)
out <- cbind(ID = c(col(ind)), ID2 = c(ind))
as.data.frame(cbind(out, cor = Correlation_mat[out]))
Final<- as.data.frame(cbind(out, cor = Correlation_mat[out]))
f=Final

for (i in 1:81) {
  for (j in 1:2){f[i,j]=row.names(Correlation_mat)[Final[i,j]] }
}

f[f[,3]>.8,]
a=f[f[,3]>.8,]

Therefore, in the file 'a' I have filtered correlation value like this

row.names    ID                    ID2                   cor
1            ENSGALG00000001744    ENSGALG00000001744    1
10           ENSGALG00000028599    ENSGALG00000028599    1
19           ENSGALG00000024138    ENSGALG00000024138    1
20           ENSGALG00000024138    Chr25_Ktn4            0.9668402
21           ENSGALG00000024138    ENSGALG00000016507    0.8607992
28           ENSGALG00000024029    ENSGALG00000024029    1
37           ENSGALG00000016507    ENSGALG00000016507    1
38           ENSGALG00000016507    ENSGALG00000024138    0.8607992
46           ENSGALG00000026609    ENSGALG00000026609    1
55           ENSGALG00000022277    ENSGALG00000022277    1
64           ENSGALG00000027059    ENSGALG00000027059    1
73           Chr25_Ktn4            Chr25_Ktn4            1
74           Chr25_Ktn4            ENSGALG00000024138    0.96684

I want to generate cluster with the above correlations. Actually my idea is pairs which are mutually correlated will be in the same cluster. Next I want to do GO enrichment analysis. Is there any suggestion for cluster generalization (visualization) and then GO enrichment analysis on the clusters

R RNA-Seq • 12k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by mjoyraj ▴ 80

2

Entering edit mode

What have you tried? It unlikely that we'll just supply you with the R code if you can't demonstrate having actually made an attempt yourself.

ADD REPLY • link 9.5 years ago by Devon Ryan 104k

Ram · Answer 1 · 2014-11-12

2

Entering edit mode

9.5 years ago

Sam ★ 4.7k

It might be a good idea if you just edit your original post.

To do what you want, the simplest way to do will be

hclust(as.dist(1-abs(Correlation_mat)))

which will perform hierarchical clustering for you. The hclust is the R function for the clustering where as.dist means that your matrix is the distance matrix (therefore the 1-)

Some other posts of yours suggests that you are trying to do something along the line of WGCNA, you can always follow their tutorial

However, with only 120 genes, I am not sure if you can find any cluster

P.S For your original method, you will very likely only obtain the diagonal of the correlation matrix, which should always be 1 (as you can see from your output).

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by Sam ★ 4.7k

0

Entering edit mode

Thanks. How to visualize the 'hclust' results...?? It gives values (list of 7)

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by mjoyraj ▴ 80

0

Entering edit mode

Please use the add comment feature

You can use heatmap.2 from the gplots library.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by Sam ★ 4.7k

0

Entering edit mode

I have used the below script to generate the plots and heatmap

hclust(as.dist(1-abs(Correlation_mat)))
hc <- hclust(as.dist(1-abs(Correlation_mat)))
plot(hc)

IN GPLOTS

install.packages("gplots")
library(gplots)
heatmap.2(Correlation_mat, main="Hierarchical Cluster", dendrogram="column",trace="none",col=greenred(10))

plot(hc) is definitely based on hierarchical clustering whether heatmap.2 also made hierarchical clustering as default..??

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by mjoyraj ▴ 80

1

Entering edit mode

You'll want to use something more like heatmap.2(1-abs(Correlation_mat), distfun=as.dist, trace="none"). That function calls hclust internally and this should produce the same clustering that you got with hc.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you very much Devon. It worked like a charm.

Well the correlation result produced is not having P-values. Is there any way to incorporate P-value. discard insignificant results from the correlation matrix and then go for clustering with the significant correlations..??

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by mjoyraj ▴ 80

0

Entering edit mode

You'd need to define significant correlation first and then filter rows and columns accordingly to keep the matrix symmetric (it's not like you can just filter values out). With only a 120x120 matrix, this seems like more trouble than it's worth.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

Yes you are right but I am not expert enough in R to do this. I started R two months back and this task is really tricky. I tried the following script below to define significant correlations and made a plot however, I could not do the same for filtering the correlation results in matrix

library("corrplot")
Correlation_mat <- cor(Rdata, method="pearson")

cor.mtest <- function(mat, conf.level = 0.95) {
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
  diag(p.mat) <- 0
  diag(lowCI.mat) <- diag(uppCI.mat) <- 1
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      tmp <- cor.test(mat[, i], mat[, j], conf.level = conf.level)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
      lowCI.mat[i, j] <- lowCI.mat[j, i] <- tmp$conf.int[1]
      uppCI.mat[i, j] <- uppCI.mat[j, i] <- tmp$conf.int[2]
    }
  }
  return(list(p.mat, lowCI.mat, uppCI.mat))
}
res1 <- cor.mtest(Rdata, 0.95)

corrplot(Correlation_mat, p.mat = res1[[1]], sig.level = 0.2)

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by mjoyraj ▴ 80

0

Entering edit mode

The general idea is to ask, "are any of the p-values in a i row and column i significant?". If not, you can remove that row and column. Since the matrix is symmetric, you can just test the rows with apply(p.mat, 1, function(x) any(x > threshold)), where threshold is your p-value cutoff.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

What is x here?

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by mjoyraj ▴ 80

0

Entering edit mode

It's like mat in your cor.mtest function. In this context, it'll hold the vector of values for a given row. BTW, try to avoid for loops in functional languages like R (I realize this will seem strange if you're used to things like perl/python/C/etc.). They're really slow.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

Devon, thanks for your suggestion. Can you elaborate a little more in the form of script, because when I tried it showed p.mat not found. So can you write the full script or else can you say where in my script it should be added?

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by mjoyraj ▴ 80

0

Entering edit mode

I'll see if I have time to slap something together today. I'm a bit busy at the moment, so no promises.

ADD REPLY • link 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

If I read the script correctly, it should be:

res1$p.mat instead of just p.mat because it should be the element within the object

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Sam ★ 4.7k

0

Entering edit mode

It's showing

Error in apply(res1$p.mat, 1, function(x) any(x > 0.2)) : 
  dim(X) must have a positive length

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by mjoyraj ▴ 80

0

Entering edit mode

Well I found a probable solution of screening significant correlations with p-values. res[[1]] gives the p-values of the correlations in a symmetric manner. I used 'cellnote=Pval' (where Pval is the p-value) function of heatmap.2 to add the p-values in the cells of heatmap. So in the heatmap cluster if any cell corresponding the genes do not have significant p-value I will not consider it in the cluster.

ADD REPLY • link 9.5 years ago by mjoyraj ▴ 80

Ram · Answer 2 · 2014-11-12

0

Entering edit mode

9.5 years ago

newlife.well.2014 ▴ 10

Please check caret package in R and use the following function:

findCorrelation(x, cutoff = .90, verbose = FALSE)

Hope this helps.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.5 years ago by newlife.well.2014 ▴ 10

0

Entering edit mode

Thanks for your suggestion, whether it will filter correlations considering only significant P-values..??

ADD REPLY • link 9.5 years ago by mjoyraj ▴ 80

0

Entering edit mode

For this function, x is a correlation matrix. Maybe you can construct your correlation matrix based on P-values. In addition, you can check the source code of caret, which is very helpful.

ADD REPLY • link 9.5 years ago by newlife.well.2014 ▴ 10