Question: Filtering gene expression correlation matrix in R
0
4.2 years ago by
mjoyraj50
Taiwan
mjoyraj50 wrote:

I have a 120 by 120 correlation matrix (Correlation_mat) obtained by comparing FPKM gene expression data of 120 genes at different time-series with cor test of R. I want to visualize the correlation results in histogram and then want to filter the top 1% correlations from the matrix. Then I want to divide the top 1% correlations into cluster based on mutual correlations. Can anybody help me with the R script?

rna-seq R • 6.7k views
modified 4.2 years ago by newlife.well.201410 • written 4.2 years ago by mjoyraj50
2

What have you tried? It unlikely that we'll just supply you with the R code if you can't demonstrate having actually made an attempt yourself.

1
4.2 years ago by
Sam2.2k
London
Sam2.2k wrote:

It might be a good idea if you just edit your original post.

To do what you want, the simplest way to do will be

`hclust(as.dist(1-abs(Correlation_mat))) `

which will perform hierarchical clustering for you. The hclust is the R function for the clustering where as.dist means that your matrix is the distance matrix (therefore the 1-)

Some other posts of yours suggests that you are trying to do something along the line of WGCNA, you can always follow their tutorial

However, with only 120 genes, I am not sure if you can find any cluster

P.S For your original method, you will very likely only obtain the diagonal of the correlation matrix, which should always be 1 (as you can see from your output).

0
4.2 years ago by
mjoyraj50
Taiwan
mjoyraj50 wrote:

1) I have generated correlation matrix with the below scripts

#Data import

#Generating correlation matrix with the 'Rdata'

cor(Rdata, method="pearson")

Correlation_mat <- cor(Rdata)

2) Then I have filtered the correlations based on a cutoff correlation coefficient (r) value

#Filtering based on Correlation value

n <- ncol(Correlation_mat)

cmat <- col(Correlation_mat)

ind <- order(-cmat, Correlation_mat, decreasing = TRUE) - (n * cmat - n)

dim(ind) <- dim(Correlation_mat)

colnames(ind) <- colnames(Correlation_mat)

out <- cbind(ID = c(col(ind)), ID2 = c(ind))

as.data.frame(cbind(out, cor = Correlation_mat[out]))

Final<- as.data.frame(cbind(out, cor = Correlation_mat[out]))

f=Final

for (i in 1:81) {

for (j in 1:2){f[i,j]=row.names(Correlation_mat)[Final[i,j]]  }

}

f[f[,3]>.8,]

a=f[f[,3]>.8,]

Therefore, in the file ‘a’ I have filtered correlation value like this

row.names

ID

ID2

cor

1

ENSGALG00000001744

ENSGALG00000001744

1.0000000

10

ENSGALG00000028599

ENSGALG00000028599

1.0000000

19

ENSGALG00000024138

ENSGALG00000024138

1.0000000

20

ENSGALG00000024138

Chr25_Ktn4

0.9668402

21

ENSGALG00000024138

ENSGALG00000016507

0.8607992

28

ENSGALG00000024029

ENSGALG00000024029

1.0000000

37

ENSGALG00000016507

ENSGALG00000016507

1.0000000

38

ENSGALG00000016507

ENSGALG00000024138

0.8607992

46

ENSGALG00000026609

ENSGALG00000026609

1.0000000

55

ENSGALG00000022277

ENSGALG00000022277

1.0000000

64

ENSGALG00000027059

ENSGALG00000027059

1.0000000

73

Chr25_Ktn4

Chr25_Ktn4

1.0000000

74

Chr25_Ktn4

ENSGALG00000024138

0.96684

I want to generate cluster with the above correlations. Actually my idea is pairs which are mutually correlated will be in the same cluster. Next I want to do GO enrichment analysis. Is there any suggestion for cluster generalization (visualization) and then GO enrichment analysis on the clusters

0
4.2 years ago by
mjoyraj50
Taiwan
mjoyraj50 wrote:

Thanks. How to visualize the 'hclust' results...?? It gives values (list of 7)

You can use heatmap.2 from library gplots

0
4.2 years ago by
mjoyraj50
Taiwan
mjoyraj50 wrote:

I have used the below script to generate the plots and heatmap

hclust(as.dist(1-abs(Correlation_mat)))
hc <- hclust(as.dist(1-abs(Correlation_mat)))
plot(hc)

IN GPLOTS

install.packages("gplots")
library(gplots)
heatmap.2(Correlation_mat, main="Hierarchical Cluster", dendrogram="column",trace="none",col=greenred(10))

plot(hc) is definitely based on hierarchical clustering whether heatmap.2 also made hierarchical clustering as default..??

1

You'll want to use something more like `heatmap.2(1-abs(Correlation_mat), distfun=as.dist, trace="none")`. That function calls `hclust` internally and this should produce the same clustering that you got with `hc`.

Thank you very much Devon. It worked like a charm.

Well the correlation result produced is not having P-values. Is there any way to incorporate P-value. discard insignificant results from the correlation matrix and then go for clustering with the significant correlations..??

You'd need to define significant correlation first and then filter rows and columns accordingly to keep the matrix symmetric (it's not like you can just filter values out). With only a 120x120 matrix, this seems like more trouble than it's worth.

Yes you are right but I am not expert enough in R to do this. I started R two months back and this task is really tricky. I tried the following script below to define significant correlations and made a plot however, I could not do the same for filtering the correlation results in matrix

library("corrplot")
Correlation_mat <- cor(Rdata, method="pearson")

cor.mtest <- function(mat, conf.level = 0.95) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
diag(p.mat) <- 0
diag(lowCI.mat) <- diag(uppCI.mat) <- 1
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], conf.level = conf.level)
p.mat[i, j] <- p.mat[j, i] <- tmp\$p.value
lowCI.mat[i, j] <- lowCI.mat[j, i] <- tmp\$conf.int[1]
uppCI.mat[i, j] <- uppCI.mat[j, i] <- tmp\$conf.int[2]
}
}
return(list(p.mat, lowCI.mat, uppCI.mat))
}
res1 <- cor.mtest(Rdata, 0.95)

corrplot(Correlation_mat, p.mat = res1[[1]], sig.level = 0.2)

The general idea is to ask, "are any of the p-values in a i row and column i significant?". If not, you can remove that row and column. Since the matrix is symmetric, you can just test the rows with `apply(p.mat, 1, function(x) any(x > threshold))`, where `threshold` is your p-value cutoff.

What is X here...??

It's like `mat` in your `cor.mtest` function. In this context, it'll hold the vector of values for a given row. BTW, try to avoid `for` loops in functional languages like R (I realize this will seem strange if you're used to things like perl/python/C/etc.). They're really slow.

Devon, thanks for your suggestion. Can you elaborate a little more in the form of script, because when I tried it showed p.mat not found. So can you write the full script or else can you say where in my script it should be added...??

I'll see if I have time to slap something together today. I'm a bit busy at the moment, so no promises.

If I read the script correctly, it should be:

`res1\$p.mat `instead of just `p.mat` because it should be the element within the object

Its showing..

Error in apply(res1\$p.mat, 1, function(x) any(x > 0.2)) :
dim(X) must have a positive length

Well I found a probable solution of screening significant correlations with p-values. res[[1]] gives the p-values of the correlations in a symmetric manner. I used 'cellnote=Pval' (where Pval is the p-value) function of heatmap.2 to add the p-values in the cells of heatmap. So in the heatmap cluster if any cell corresponding the genes do not have significant p-value I will not consider it in the cluster.

0
4.2 years ago by
United States
newlife.well.201410 wrote:

Please check ''caret" package in R and use the following function:

findCorrelation(x, cutoff = .90, verbose = FALSE)

Hop this help.

Thanks for your suggestion, whether it will filter correlations considering only significant P-values..??