Hi there, I have two data sets of genes (Exprdata_H and Exprdata_P), which I used to construct a consensus WGCNA following the tutorial by P. Langfelder and S. Horvath. Prior to generate the multiExpr data set, I separately filtered for both data sets the highly variable genes using the following code (for dataset Exprdata_H only):
log2 transform (x+1 to avoid zeroes)
log2Exprdata_H = log2(Exprdata_H+1)
Extract the genes with the highest variance (upper 75 percentile), use package genefilter.
library(genefilter)
rv = rowVars(log2Exprdata_H)
q75 = quantile(rowVars(log2Exprdata_H), .75)
q75log2Exprdata_H = log2Exprdata_H[rv > q75, ]
This resulted in 7100 genes for both data sets, which I then combine in one multiExpr set:
exprSize = checkSets(multiExpr)
$nSets
[1] 2
$nGenes
[1] 7100
$nSamples
[1] 108 72
$structureOK
[1] TRUE
This looks good. However, when I checked the gene IDs which are assigned to the consensus modules, it seems that the genes of only one data set are used but not the ones from the second data set in the MultiExpr set. (If I check the overlap between the two sets of 7100 highly variable gene sets, I obtain an overlap of ~5000 genes and 2100 genes specific to one of each data set. The 2100 genes of the original dataset H are not present in any module, while some of the 2100 P-specific genes are present in some modules). Could this be possible or do I have a mistake somewhere in my R script? Any suggestions or hints are very welcome! Thanks, Jutta