Is there a way to use multiple processors (parallelize) to create a heatmap for a large dataset?
1
0
Entering edit mode
24 months ago
Pratik ▴ 850

I have a gene-gene correlation matrix of about 20,000 genes by 20,000 genes. I am trying to generate a heatmap similar to how JMP can create plots like this:

Image from Sunil Archak (https://sscnars.icar.gov.in/Genetics/11-%20jmp_exp.pdf)

I want to see how the resulting data will look. Unfortunately, as I was typing this R Studio Aborted the session as it was trying to generate the heatmap. Too much to handle?

Any ideas?

Alternatively, I was advised to plot genes by principle components. I may be leaning towards this approach now... unless someone knows a solution to this. I do have a decent amount of cores (16) and and decent amount of RAM 128GB.

Any help would be appreciated.

Very Respectfully, Pratik

parallel processing heatmap scRNA-seq • 921 views
1
Entering edit mode

If you have access to a cluster, it might help to use that and generate a file output instead of interactive graphic output.

0
Entering edit mode

Thank you for respond sir. Let me try this...

0
Entering edit mode

Sir, do you know if there is a way to parallelize the process though, because it is not an issue with computing power? The issue here is R and the R packages I am using are not using the full computing power. Nearly the whole server I am using, is idle... If there is some way to use all of the processors, this would speed up the process substantially for me.

Very Respectfully, Pratik

1
Entering edit mode

Unfortunately, no. If the package is not designed to use multiple cores, there's not much you can do about it.

1
Entering edit mode

Maybe try in the standard R (command-line)? R Studio uses unnecessary extra resources.

pvclust can do clustering in a parallelised fashion, but it does not generate heatmaps.

If you want, start a new session on command-line and show the output of sessionInfo()?

1
Entering edit mode

Thank you for your response sir. I understand now why @_r_am was suggesting to access a cluster for this. This is a huge job even for my personal server, I think? I'm asking the machine to plot a 20,000 genes x 20,000 genes matrix with colors associated with different levels of correlation. The matrix is 2.7GB alone.

Do you think hierarchical clustering using pvclust and then using the generated clusters to plot a heatmap/dendrogram will speed up the computation, or will hierachical clustering be yet another layer of data ontop of the gene-gene correlations?

EDIT: I'm pretty sure this is what I wanted to do from the beginning just not sure how to... This was also the suggestion I got from a mentor. Cluster first and then plot in dendrogram-heatmap. But I'm wondering about how computationally tractable it will be to do this?

I guess I could set it up either way, maybe? genes by clusters if I want to? or genes by genes with cluster labeling? The latter would probably be more computationally heavy? (I think that these types of dendrogram-heatmaps only work on a square matrix so the latter may be required for this?

The next challenge, I think, would be figuring out how to take the cluster assignments from pvclust, group one of the 20,000 genes by 20,000 genes axes into the cluster assignments, and then use a package like Heatmap2.0 to create a beautiful plot like this:

Image obtained from: https://doi.org/10.3389/fnhum.2015.00440

Please correct me if I'm wrong on this! Or if you have any pointers on this process, please.

Here is my sessionInfo():

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.2


I know it's not a purely command-line server (using Ubuntu desktop), but it does have some decent computing power.

Thank you again @_r_am and @Kevin Blighe.

Very Respectfully, Pratik

1
Entering edit mode

See if you can do that with ComplexHeatmap. It's quite extensible and might allow for the clustering and dendrograms to be precomputed - or you should atleast be able to pre-compute row and column orders and separate the clustering compute consumption from the graphical rendering time consumption.

1
Entering edit mode

Thank you very much sir! I will try this!

Very Respectfully, Pratik

3
Entering edit mode
24 months ago
Pratik ▴ 850

Just thought I'd post this as my solution, in case anyone is stuck in a similar situation.

How do I subset a Seurat object using variable features?

I know it's common practice in analyzing data (ie. to take a sample of the larger dataset to get a gist of the data). Some sort of heatmap-dendrogram is better than no heatmap-denodrgram!!

So rather than doing all 20,000 by 20,000 genes I used Seurat to take the top 1000 variable features and then made the corresponding gene-gene correlation matrix after transposing the table using cor() and then plotted a "rough-draft" using the base-R heatmap() function:

## Heatmap of top 1000 variable features by top 1000 variable features

I do plan to use ComplexHeatmap as suggested by @_r_am to generate a prettier dendrogram-heatmap.

Hope someone finds this useful!

Very Respectfully, Pratik