I'm trying to derive a measure of tumour heterogeneity in scRNA-seq data. For a given individual's scRNA-seq gene expression matrix, I would like to calculate the Pearson correlations within and between clusters (compare the average cell-to-cell correlation within clusters and the average cell-to-cell correlation between clusters). I am using the cor() function from the R 'stats' package and using the log-normalized gene expression matrix as input:

c2.dat <- data.frame(c2@assays$RNA@data) #gene expression matrix for a subject's cells in "cluster 2" c2.cor <- cor(c2.dat, method = "pearson") #correlation analysis on log-normalized gene expression matrix

I am stuck though once I have a correlation matrix. How do I calculate the average cell-to-cell correlation within this cluster?

Here is a small code sample that you can run to plot the intra-cluster correlation distribution of each cluster.
However I suggest you to run PCA on the expression matrix before calculating the cell-to-cell correlation based on the projection of cells into the PCs space (e.g. first 20 PCs). This would allow to eliminate redundant features & reduce noise.

You will need a few libraries :


This is to reproduce a count expression matrix with two distinct cell populations having different heterogeneities :

# Mimick two cell clusters with different variablity (heterogeneity)
homogenous = matrix(rnorm(n=75*500),nrow = 500,ncol = 125) + 
    matrix(c(rep(c(0),500*75),rep(c(1),500*50)),byrow = T,nrow = 500,ncol = 125)
heterogenous = matrix(rnorm(n=75*500),nrow = 500,ncol = 75)

mat <- cbind(homogenous, heterogenous)

colnames(mat) = paste0("cell_",1:200)

Calculate correlation, as you did:

cor_mat <- cor(mat)

Generate an metadata data.frame with 2 clusters

# Mimick three cell clusters
cluster_df <- data.frame(cell_id = paste0("cell_",1:200),

Create cell to cell correlation data.frame with tidyr::pivot_longer function, that will give the correlation score of any given "cell of origin" with any "other cell"

# Intra Correlation 
cor_df <-
cor_df$cell_of_origin <- rownames(cor_mat)
cor_df <- tidyr::pivot_longer(cor_df, cols = seq_len(ncol(cor_mat)),

Remove self correlations (e.g. cell_1 with cell_1), as it is always 1

cor_df <- cor_df[-which(cor_df$cell_of_origin == cor_df$other_cell),]

Add cluster information (cluster of the cell of origin & cluster of the other cell)

cor_df$cell_of_origin_cluster <- cluster_df$cluster[match(cor_df$cell_of_origin,cluster_df$cell_id)]
cor_df$other_cell_cluster <- cluster_df$cluster[match(cor_df$other_cell,cluster_df$cell_id)]

Select only cells that belong to the same clusters

intra_corr <- cor_df[cor_df$cell_of_origin_cluster==cor_df$other_cell_cluster,]

Violin plot of intra correlation distribution by cluster

ggplot(intra_corr,aes(x = cell_of_origin_cluster,y=correlation, fill = cell_of_origin_cluster)) + 
    geom_violin() + theme_classic() + geom_jitter(size=0.2)

For inter-cluster correlation, you can do the same, selecting only cells that don't belong to the same clusters, but this makes sense only if you have more than 2 clusters:

inter_corr <- cor_df[cor_df$cell_of_origin_cluster!=cor_df$other_cell_cluster,]

# Violin plot of intra correlation distribution by cluster
ggplot(inter_corr) + geom_violin(aes(x = cell_of_origin_cluster,y=correlation,
                                     fill = cell_of_origin_cluster)) + theme_classic()
Wow thank you so much! This is exactly what I was looking to do! Very much appreciated :)

