Hi everyone,
I am performing WGCNA on bulk RNA sequencing data obtained from a cohort of ~150 human brain samples. I am having troubles interpreting some of the large clusters I obtain from the analysis and I hope someone could help me with that.
I start the analysis with raw count data and the first thing I do is removing lowly expressed genes; I keep only genes with CPM > 4 in at least 1/3 of the samples. This leaves me with ~13000 genes; at which point I do TMM normalisation. Then I look for outliers using on hierarchical clustering and PCA: I remove 13 samples that cluster away from the others and also show low RIN values.
After removing the outliers I performed TMM normalisation again and I use the resulting logCPM values as input for the pickSoftThreshold() function (I get a power around 10 for a signed network type). I then use the function blockwiseModules() with the following arguments:
- maxBlockSize = 14000 (higher than the number of input genes)
- TOMType = 'signed'
- deepSplit = 2
- pamStage = TRUE,
- randomSeed = 1234
I get ~14 good-size modules (100 genes on average) and 5 very large modules. From what I understood the expected situation is the have one large grey modules that encompasses all the unassigned genes and smaller modules that separate from this one. For this reason I thought that having these other large modules in my dataset might be due to some sort of batch effect. I tried to batch correct for a bunch of different variables (RIN, library size, postmortem interval, pH, mapped reads) but none of them 'removes' the high-level correlations that interest more than half of my gene dataset (~7500 genes). I have definitely seen dendrograms that look worse, but still this situation doesn't seem ideal to me and I would like to know how to interpret these modules.
Thanks in advance for any input!