Is it possible to run a consensus analysis in WGCNA for 3 different datasets?
I have a large RNA-seq dataset that I am trying to run WGCNA on. The data set has many variables including (including 3 brain regions, 3 ages, sex, 2 genotypes.) When running the analysis on the entire data set as a whole pretty much everything was driven by the major differences between brain region. I am more interested in parameters such as sex and genotype so I decided to take a different approach and separate the data into 3 groups based on region and run a consensus analysis. Though I am not sure if this is mathematically possible.
Does this mean I cannot use the regular suggested network construction?
I have successfully loaded all three sets. Or would this be utilized during visualization instead of
1-TOMsimilarityFromExpr
? I have been following tutorial #2 for consesus analysis (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/)If you have limited RAM, you can use
blockwiseConsensusModules
instead ofconsensusTOM
. The output ofconsensusTOM
is just a consensusTOM
that must be converted into a dissimilarity matrix:disTOM = 1 - consensusTOM
in order to proceed with the clustering and module identificationI have 32 gigs of RAM for 19,000 genes and it took a few hours for the
blockwiseConsensusModules
function to run. Eventually it seemed to work but I wasn't sure if this function was appropriate to use for 3 data sets.32 is plenty of RAM for 19k genes. This function can be used on three or more data sets. Always check the description:
I suppose I am slightly confused the difference between
blockwiseConsensusModules
andconsensusTOM
and how they are used in context. Are they interchangeable? I assumed I needed to useblockwiseConsensusModules
to actually construct the network?Do I just use
consensusTOM
as I would for network construction instead ofblockwiseConsensusModules
and then proceed to usedisTOM = 1 - consensusTOM
as follows?net = consensusTOM( multiExpr, power = 6, maxBlockSize = 19000, minModuleSize = 30, deepSplit = 2, pamRespectsDendro = FALSE, mergeCutHeight = 0.25, numericLabels = TRUE, minKMEtoStay = 0, saveTOMs = TRUE, verbose = 5)
dissTOM = 1-consensusTOM(multiExpr, power = 6);
I know it is confusing but keep in mind that in WGCNA the TOM is the network. The functions
consensusTOM
andblockwiseConsensusModules
give you back a TOM but onlyblockwiseConsensusModules
perform the clustering and the module detection.This chunck of code is wrong:
because
minModuleSize, pamRespectsDendro, mergeCutHeight
are not arguments of the functionconsensusTOM
Please, read the description of consensusTOM and blockwiseConsensusModules
Calculation of consensus Topological Overlap with consensusTOM
Clustering and module identification
From here you can proceed with 2.a.7
So what would be the benefit of using
consensusTOM
vsblockwiseConsensusModules
when essentially it seems that they are doing the same thing exceptconsensusTOM
seems to take the additional steps to cluster and identify the modules? IfblockwiseConsensusModules
can be used to achieve the same thing in one step and use less memory why is this not the preferred method?Additionally, following the step-wise network construction- it looks like there is a whole part of the tutorial where there is "Scaling of Topological Overlap Matrices to make them comparable across sets"- is this something that can also be adapted to be used for three data-sets?
I went to the step-wise network construction tutorial and followed each step until the
consensusTOM
and my computer crashed at the following line because it ran out of ram (it crashed while at 30 gigs)This is strange because you managed to get a consensusTOM with
blockwiseConsensusModules
without any problem. Is that right?Can you post the command line of the functions
consensusTOM
withblockwiseConsensusModules
?I started trying to follow the step-wise network construction tutorial and ran:
But did not continue with this strategy because R-studio kept aborting and after checking my computer's memory usage it used 30 gigs of RAM before aborting. So I totally removed the function
consensusTOM
because my computer couldn't get past that line without crashing. After loading all of my data I determined the the soft thresholding I ran theblockwiseConsensusModules
Then it loops and does this for all three sets
In the the first approach
step-wise network construction
R-studio run out of memory because you have 6 matrices (3 adjacencies and 3 TOMs).It doesn't matter at this point because
blockwiseConsensusModules
withmaxBlockSize = 19000
should give you back the same consensus TOM of:Thank you so much for all of your help. I just have one follow up question: One thing that seemed interesting about the step-wise network construction is the fact that there was the opportunity to scale the TOMs. When calculating the soft thresholding I get the following graphs. Clearly the Cerebellum and the S1 have similar "fits" but it looks like the hippocampus has lower power. I chose a power of 6 because it is where they all seem to meet. Do you think this data needs be scaled to account for any bias from that third data set? There doesn't seem to be any outliers in that data when looking at the dendrogram
You have two scaling options in
blockwiseConsensusModules
:single quantile
andfull quantile
. You select the scaling with the argumentnetworkCalibration
. Read the detail to understand what each scaling does.About the power… you said you do not have any evident outlier, then you have 2 huge problems here.
First, the mean connectivity in S1 and C is too high no matter which power you pick. For each of these datasets, you will get one huge cluster of genes up and down-regulated between males and females. The results of a consensus TOM between S1 and C is a single cluster where everything is connected (not even close to a scale-free topology) -> hairball -> not interesting. Second, the hippocampus is rather flat and probably you do not have any expression pattern. So you might have 2 problems 1) no variability in the hippocampus sample; 2) you have a strong driver of variation, probably sex, in S1 and C samples
May I ask you where did you get these data and how were normalized?
My lab ran a time course RNA seq on bulk tissue from our novel Knock out mouse model and we collected 4 samples from each group (including 3 brain regions, 3 ages, sex, 2 genotypes.) The RNA seq was analyzed by a data science core facility using limma voom. The input I used is the log transformed normalized data
First of all, ask for the raw expression data.
Second, I am not familiar with the limma-voom normalization but if it try to model the variance according to the experimental design this would explain why you get these huge single clusters showing opposite expression trends between brain regions (first network approach) or sex (consensus network). You could try DESeq2 varianceStabilizingTransformation with
blind = TRUE
for normalization.Finally, not every data set can analyzed with WGCNA. Given the limited information I have, maybe there is nothing wrong with your data. This is how the data behave and a differential expression analysis is perhaps the best strategy
Cheers
Limma voom does a number of things for normalization that could be problematic. 1) It creates a general linear model for the data which we liked because it helped to take into account the many variables. 2) since samples from different regions often come from the same mouse (i.e. replicate 1 from the cerebellum will often come from the same mouse as the replicate 1 from the hippocampus) so limma voom correlates samples from the same mouse. 3) The "voom" function adds weights to the samples based on how closely they cluster within a PCA
Perhaps if I use the raw expression data and use
fpkm
and log transform it?Yes, you can log transform fpkm data and then proceed with WGCNA or better, use DESeq2 normalization.
One last thing, this post could really help you. See Peter Langfelder answers.
Thank you so much for all of your help! I will try out all of these suggestions!
hello, sorry to piggy back on this discussion but I'm also in the same situation with having three data sets. I want to see whether there are any networks preserved between A in a combination of BC. Is this a possible since the conventional way is to set two setLabels. How does your multiExpr data look like? Thanks
This is just an example