I'm doing RNA-seq analysis. Basically, I have got the RPKM matrix [genes * samples] and samples here are several time points.
Now I want to cluster genes by their expression pattern across time series. When I'm trying to determine the optimum cluster number of k-means, I tried to use NbClust ( a package in R) to estimate it. But NbClust just didn't work well when I ran code like below:
System is computationally singular on solve()
set.seed(123) RPKM_clust_db <- NbClust(RPKM_log2_de0_scale, diss=dist_pe, distance = NULL, min.nc = 2, max.nc = 100, method = "kmeans", index = "all", alphaBeale = 0.1)
- dist_pe means the the pearson correlation distance I calculated before and thus, the "distance" behind is NULL.
Warnings came out and stopped the program:
Error in solve.default(W) : system is computationally singular: reciprocal condition number = 2.65874e-17 In addition: There were 15 warnings (use warnings() to see them)
According to some other answers: https://stackoverflow.com/questions/36403293/system-is-computationally-singular-error-when-i-use-winsorize
This problem may be caused by function solve().
The error probably occurs because you included some variables/columns that are very highly correlated, or rather, they are linear combinations of each other. You may want to check if you have duplicated variables or variables that are transformations of each other.
- For this case, if highly correlated columns are a problem, how can RNA-seq data be possible for using this package? I mean that there should be many genes that remain a similar expression level for some times.
Could you give me some advices on how to modify my code? How to use NbClust? Or, whether using NbClust is a good choice? Is there any other choice for estimating optimum cluster number for such time series RNA-seq data?