Help selecting WGCNA soft thresholding power
1
0
Entering edit mode
26 days ago
jfaberha ▴ 40

Hi, I'm attempting to construct a consensus WGCNA network for 4 related expression datasets. For a signed network, two of the datasets achieve a high scale-free topology very early, but for the other two they don't reach 0.8 until a very high power.

Here, the last of four datasets achieves a scale-free topology at 28, although there appears to be a slight peak in this metric at 17.

A power of 28 seems high to me (although I may be wrong), especially when the WGCNA FAQ recommends choosing a maximum power of 18 for signed datasets when presented with a lack of scale free topology. Is there any argument to be made for 17 since the graph shows that slight peak? https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html

If it makes more sense to consider our multi-expression data to lack scale-free topology, suggesting we choose a power from the table below, how should we determine our sample size from column 1?

Should we consider our total number of samples across all 4 datasets (>70), or should we choose the average, minimum, or maximum number of samples per dataset (<20)?

In a related question, does the scale-free topology look better if we use an unsigned network (please ignore the graphical numbering glitch)?

Based on these results, it seems we should choose a power of 11 since it is the lowest power where all 4 datasets have a scale-free topology >0.8.

If we were to choose an unsigned network instead, is it valid to run downstream statistics on eigengene values from an unsigned network even though modules would contain sets of genes with both positive and negative correlations? Since the eigengene values represent principal component 1 for a given module, I assume they represent the strongest expression signal within a module and would not combine signals from both positively and negatively correlated gene sets. Am I wrong in this assumption? Are the genes that contribute to PC1 in each module held consistent when producing eigengene values across all samples and datasets?

Thank you for all your help!

rna-seq R WGCNA transcriptome • 1.2k views
0
Entering edit mode

Hi,

Can you post the PCA plot?

0
Entering edit mode

Sure, here it is. By the way, this is based on a VST normalized dataset from DESeq2. Is it possible a r-log transformation could yield a better scale-free topology? The PCA for an r-log transformed dataset looks nearly identical.

Here, PC1 explains differences in tissue and PC2 explains genetic lineage, so the results capture the expected drivers in variation between datasets. There is more variation within intestines than brains, but perhaps that is also expected as well since intestine would be more impacted by external stimuli via diet.

0
Entering edit mode

First though, If the original plan was to build a network for each group of samples I am afraid you can't do that because after removing some of the samples that look like outliers in the LS.INT and WS.INT you will not have enough samples for network analysis. Second thought, there seems to be a lot of within-group variability in the LS.INT and WS.INT groups which could negatively affect the dispersion estimates. The only solution to these problems is to

1. Subset the dataset in two groups: INTESTINE and BRAIN
2. Run the VST for each group
3. Check for the presence of outliers in each group
4. Build a consensus network of the INTESTINE and BRAIN samples by using the function consensusTOM

From the consensusTOM, you can still get modules of co-expressed genes between the 4 groups of samples

0
Entering edit mode

Thank you for your thoughts. We are perhaps more interested in the differences between LS and WS, while differences between intestine and brain are interesting but more serve as a sort of reciprocal control to put differences between LS and WS in context. For example, if we see the same module variance between WS and LS in both tissues we assume these could be system-wide expression patterns or if we see an interaction between both variables we assume expression differences between LS and WS could be tissue-specific.

Fortunately, we weren't intending on making 4 separate networks. We planned to make a single consensus network using all four individual datasets, then compare module preservation between each of the four pairs of datasets and run DE-like statistics on the resulting module eigengenes. Will building a single consensus network from intestine and brain datasets differ from a consensus network from all 4 datasets? If we build a consensus network from just 2 datasets (intestine and brain) instead of 4, would our above experimental question and proposed analyses still be valid/possible?

I should also mention this is a filtered dataset. The VST was first run using all genes for all samples, but we then removed genes that didn't have raw counts of at least 10 copies in every single sample. This was done to ensure that any significant results and observed interactions were not primarily driven by lack of expression in one of the two tissue types. Is this a reasonable way to proceed?

*We also already remove some outlier samples based on the sample trees. The remaining "outliers" did not seem so different from others within their dataset based on these trees.

0
Entering edit mode

Fortunately, we weren't intending on making 4 separate networks. We planned to make a single consensus network using all four individual datasets, then compare module preservation between each of the four pairs of datasets and run DE-like statistics on the resulting module eigengenes. Will building a single consensus network from intestine and brain datasets differ from a consensus network from all 4 datasets? If we build a consensus network from just 2 datasets (intestine and brain) instead of 4, would our above experimental question and proposed analyses still be valid/possible?

You could build a consensus network from the 2 datasets (intestine and brain) (be sure that the intestine and brain dataset shares the same genes) and then a tissue-specific network only with the brain samples. Then, for the preservation analysis, you can use the brain-specific network as a reference to check if the brain.WS or brain.LS modules are also preserved in the consensus network.

I should also mention this is a filtered dataset. The VST was first run using all genes for all samples, but we then removed genes that didn't have raw counts of at least 10 copies in every single sample. This was done to ensure that any significant results and observed interactions were not primarily driven by lack of expression in one of the two tissue types. Is this a reasonable way to proceed?

Sounds good to me. You should first remove the low count genes and then proceed with the VST normalization.

0
Entering edit mode

You could build a consensus network from the 2 datasets (intestine and brain) (be sure that the intestine and brain dataset shares the same genes) and then a tissue-specific network only with the brain samples. Then, for the preservation analysis, you can use the brain-specific network as a reference to check if the brain.WS or brain.LS modules are also preserved in the consensus network.

Thank you for your response, and sorry for my delayed reply, but may I ask for some clarifications on this point? Do you mean to say that we could only do an WS vs. LS preservation rank comparison for the brain in this way, but not for intestine, and that the intestine samples must be grouped for the creation of a valid network? WGCNA is still a little confusing for me, but how might a consensus network between two datasets differ from a consensus network using four datasets if the exact same samples are used each time?

0
Entering edit mode

Could you share the normalized expression matrix of all the genes shared between the datasets so I can show you all the possible comparison you can do in the analysis? You can replace the gene names with a fake gene id.

0
Entering edit mode

I've uploaded the subsetted datasets for each of the conditions, split up by tissue only and alternatively by both tissue and stock (LS vs. WS). This folder has both the VST expression matrices and colData for each subset, but if you prefer all samples together you can do an rbind/cbind with the "vst.BRN" and "vst.INT" datasets. Hope this format works for you, and thanks again for you help.

0
Entering edit mode

I went ahead and subsetted the dataset by tissue type and to see if there's as much within group variability and overlap without the overwhelming signal of PC1 (from the full dataset). It maybe looks a little better for intestine samples this way, but looking at the unfiltered dataset gives us a much better separation of LS and WS, except for possibly one remaining outlier.

The resulting scale-free analysis from the unfiltered dataset also looks slightly better, with all 4 samples crossing the 0.8 threshold at a power of 12 for a signed network (again, ignore the weird numbering glitch).

Unsigned scale free analysis is a little messier, with both intestine datasets approaching 0.8 at a power of 9 and finally surpassing it at 15 or 16.

If we ran this analysis on all genes we expect it would give us false positive expression differences for eigengenes due to the fact many genes are likely tissue-specific, since we are mainly targeting interactions. Therefore, analysis on the filtered dataset is preferred, but if it's statistically dubious we can run your proposed pipeline or deal with false positives in a different way. Thoughts?

0
Entering edit mode

If we ran this analysis on all genes we expect it would give us false positive expression differences for eigengenes due to the fact many genes are likely tissue-specific, since we are mainly targeting interactions. Therefore, analysis on the filtered dataset is preferred, but if it's statistically dubious we can run your proposed pipeline or deal with false positives in a different way. Thoughts?

As far as I can tell you, all the dataset included in the WGCNA analysis must share the same genes.

1
Entering edit mode
16 days ago

Here we go.

A word of caution... you seems to have a lot of variation especially in the INT dataset that is not explained by the WS-LS traits. This is why I get a lot of modules. If you have other traits to include in the analysis please, do so.

library(WGCNA)
options(stringsAsFactors = FALSE)


Prepare multiExpr

vstINT <- read.csv("gene_count_matrix_filt_vst.INT.csv", row.names=1)
nSets = 2;
multiExpr = list()
multiExpr[[1]] = list(data = t(vstBRN))
multiExpr[[2]] = list(data = t(vstINT))

setLabels = c("BRN", "INT")
names(multiExpr) = setLabels
lapply(multiExpr, lapply, dim)

# Check that the data has the correct format for many functions operating on multiple sets:
exprSize = checkSets(multiExpr)


Calculate the power for signed network

powers = c(seq(4,10,by=1), seq(12,20, by=2))
# Initialize a list to hold the results of scale-free analysis
powerTables = vector(mode = "list", length = nSets)
# Call the network topology analysis function for each set in turn
for (set in 1:nSets)
powerTables[[set]] = list(data = pickSoftThreshold(multiExpr[[set]]$data, powerVector=powers, networkType = "signed", verbose = 2)[[2]])  Use your code to plot the results of 'pickSoftThreshold' (powerTables): Calculate the consensusTOM consensusTOM has a lot of arguments. I used the default ones but please have a look at the usage consTOM <- consensusTOM(multiExpr, checkMissingData = TRUE, maxBlockSize = Inf, nPreclusteringCenters = NULL, randomSeed = 12345, corType = "pearson", maxPOutliers = 0.05, quickCor = 0, pearsonFallback = "individual", power = 9, networkType = "signed", checkPower = TRUE, TOMType = "signed", TOMDenom = "min", saveIndividualTOMs = TRUE, individualTOMFileNames = "individualTOM-Set%s-Block%b.RData", networkCalibration = "full quantile", saveCalibratedIndividualTOMs = TRUE, calibratedIndividualTOMFilePattern = "calibratedIndividualTOM-Set%s-Block%b.RData", calibrationQuantile = 0.95, sampleForCalibration = TRUE, sampleForCalibrationFactor = 5000, getNetworkCalibrationSamples = FALSE, consensusQuantile = 0, useMean = FALSE, setWeights = NULL, saveConsensusTOMs = TRUE, consensusTOMFilePattern = "consensusTOM-Block%b.RData", returnTOMs = TRUE, nThreads = 10, verbose = 5) TOM<-as.matrix(consTOM[["consensusTOM"]][[1]]) # get the consensus TOM rownames(TOM)<-colnames(multiExpr$BRN$data) colnames(TOM)<-colnames(multiExpr$BRN$data)  Get the consensus modules consTree = hclust(as.dist(1-TOM), method = "average"); minModuleSize = 30; Labels = cutreeDynamic(dendro = consTree, distM = 1-TOM, deepSplit = 2, cutHeight = 0.995, minClusterSize = minModuleSize, pamRespectsDendro = FALSE ); Colors = labels2colors(Labels) plotDendroAndColors(consTree, Colors, "Dynamic Tree Cut", dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05)  Follows part 2.. ADD COMMENT 1 Entering edit mode Check which consensus modules correlate with the brain and intestine traits (WS and LS) data traitBRN <- read.csv("colData_filt_vst.BRN.csv", row.names=1, sep=";") #this is a binary table (1,0) of the colData_filt_vst.BRN head(traitBRN) WS LS BRN_LSB14_02 0 1 BRN_LSB14_03 0 1 BRN_LSB14_04 0 1 BRN_LSB14_05 0 1 BRN_LSB14_06 0 1 BRN_LSB14_08 0 1 traitINT <- read.csv("colData_filt_vst.INT.csv", row.names=1, sep=";") #this is a binary table (1,0) of the colData_filt_vst.INT MEs_BRN = moduleEigengenes(multiExpr$BRN$data, Colors)$eigengenes
MEs_INT = moduleEigengenes(multiExpr$INT$data, Colors)$eigengenes moduleTraitCor_BRN = cor(MEs_BRN, traitBRN, use = "p") moduleTraitCor_INT = cor(MEs_INT, traitINT, use = "p") BRN_nSamples = nrow(multiExpr$BRN$data) INT_nSamples = nrow(multiExpr$INT$data) moduleTraitPvalue_BRN = corPvalueStudent(moduleTraitCor_BRN, BRN_nSamples) moduleTraitPvalue_INT = corPvalueStudent(moduleTraitCor_INT, INT_nSamples) # Initialize matrices to hold the consensus correlation and p-value consensusCor = matrix(NA, nrow(moduleTraitCor_BRN), ncol(moduleTraitCor_BRN)); consensusPvalue = matrix(NA, nrow(moduleTraitCor_BRN), ncol(moduleTraitCor_BRN)); # Find consensus negative correlations negative = moduleTraitCor_BRN < 0 & moduleTraitCor_INT < 0; consensusCor[negative] = pmax(moduleTraitCor_BRN[negative], moduleTraitCor_INT[negative]); consensusPvalue[negative] = pmax(moduleTraitPvalue_BRN[negative], moduleTraitPvalue_INT[negative]); # Find consensus positive correlations positive = moduleTraitCor_BRN > 0 & moduleTraitCor_INT > 0; consensusCor[positive] = pmin(moduleTraitCor_BRN[positive], moduleTraitCor_INT[positive]); consensusPvalue[positive] = pmax(moduleTraitPvalue_BRN[positive], moduleTraitPvalue_INT[positive]); textMatrix = paste(signif(consensusCor, 2), "\n(", signif(consensusPvalue, 1), ")", sep = ""); dim(textMatrix) = dim(moduleTraitCor_BRN) sizeGrWindow(10,7) par(mar = c(6, 8.8, 3, 2.2)); labeledHeatmap(Matrix = consensusCor, xLabels = names(traitBRN), yLabels = names(MEs_INT), ySymbols = names(MEs_INT), colorLabels = FALSE, colors = blueWhiteRed(50), textMatrix = textMatrix, setStdMargins = FALSE, cex.text = 0.5, zlim = c(-1,1), main = paste("Consensus module--trait relationships across\n", paste(setLabels, collapse = " and ")))  you can see that the light-green and light-cyan consensus modules includes genes that are highly related to LS-WS in both sets (BRN and INT). See this tutorial for a full explanation of the analysis Heatmap of the consensus light-cyan and light-green modules in the BRN and INT expression dataset library(gplots) col_scale = colorpanel(100, 'purple','black','yellow') heatmap.2(t(multiExpr$INT$data[,Colors=="lightcyan"]), scale = "row", col=col_scale, density.info ="none", trace="none", cexCol=0.7, cexRow=0.8, margin=c(19,11), main = "INT lightcyan module", Colv = FALSE) heatmap.2(t(multiExpr$BRN$data[,Colors=="lightcyan"]), scale = "row", col=col_scale, density.info ="none", trace="none", cexCol=0.7, cexRow=0.1, margin=c(19,11), main = "BRN lightcyan module", Colv = FALSE) heatmap.2(t(multiExpr$INT$data[,Colors=="lightgreen"]), scale = "row", col=col_scale, density.info ="none", trace="none", cexCol=0.7, cexRow=0.8, margin=c(19,11), main = "INT lightgreen module", Colv = FALSE) heatmap.2(t(multiExpr$BRN$data[,Colors=="lightgreen"]), scale = "row", col=col_scale, density.info ="none", trace="none", cexCol=0.7, cexRow=0.8, margin=c(19,11), main = "BRN lightgreen module", Colv = FALSE)  Calculate and overlap contingency table to check if the WS and LS modules in the consensus network overlap with the WS-LS modules of the tissue-specific network (e.g BRAIN (BRN) network). See the full tutorial here (section 5d) load("individualTOM-Set1-Block1.RData") # upload the BRN TOM TOMbrn<-as.matrix(tomDS) # TOM rownames(TOMbrn)<-colnames(multiExpr$BRN$data) colnames(TOMbrn)<-colnames(multiExpr$BRN$data) # get the modules for the BRN network Tree_BRN = hclust(as.dist(1-TOMbrn), method = "average"); minModuleSize = 30; Labels_BRN = cutreeDynamic(dendro = Tree_BRN, distM = 1-TOMbrn, deepSplit = 2, cutHeight = 0.995, minClusterSize = minModuleSize, pamRespectsDendro = FALSE ); Colors_BRN = labels2colors(Labels_BRN) # create a contingency table overlap = overlapTable(Colors, Colors_BRN); numMat = -log10(overlap$pTable);
numMat[numMat >50] = 50;
textMat = paste(overlap$countTable, "\n", signif(overlap$pTable, 2));
dim(textMat) = dim(numMat)
xLabels = paste("M", sort(unique(Colors_BRN)));
yLabels = paste("M", sort(unique(Colors)));
xSymbols = paste(sort(unique(Colors_BRN)), ": ", table(Colors_BRN), sep = "")
ySymbols = paste(sort(unique(Colors)), ": ", table(Colors), sep = "")

# plot the contingency table
fp = FALSE;
tiff("contingency.tiff", res = 100,  width = 27, height = 19, units = 'in')
fcex = 1.00;
pcex = 1.0
fcexl = 1.00;
pcexl = 1.00;
par(mar = c(15, 15, 2, 1.0));
labeledHeatmap(Matrix = numMat,
xLabels = xLabels, xSymbols = xSymbols,
yLabels = yLabels, ySymbols = ySymbols,
colorLabels = TRUE,
colors = blueWhiteRed(100)[50:100],
textMatrix = textMat, cex.text = if (fp) fcex else pcex, setStdMargins = FALSE,
cex.lab = if (fp) fcexl else pcexl,
xColorWidth = 0.08,
main = "C. BRN (rows) vs. CONS modules (columns)", cex.main = 1.2);
dev.off()


Sorry for the bad quality of the image but you can see that 102 out 196 genes of the light-green module in the consensus network overlap with the red module (667 genes) in the BRN-network. Same for the light-cyan module. 52 and 47 genes overlap respectively with the pink and yellow modules of the BRN-net.

Lets focus on the red module and have a look at its expression profile

heatmap.2(t(multiExpr$BRN$data[,Colors_BRN=="red"]), scale = "row",
col=col_scale, density.info ="none", trace="none",
cexCol=0.7, cexRow=0.8, margin=c(19,11), main = "BRN-net red module", Colv = FALSE)


It is pretty clear from the heatmap that the red module of BRN-net will correlate significantly with the LS-WS traits. Try to calculate the module-trait relationship for the BRN-net by yourself.

0
Entering edit mode

Wow, thank you for all your work on this response. You went above and beyond. This is extremely helpful and introduced some new analyses that we can do. This is a good starting point, and beyond this we are most interested in module preservation between all four datasets. For example, we want to look at module preservation between:

WS.BRN vs WS.INT, LS.BRN vs LS.INT, WS.BRN vs LS.BRN, WS.INT vs LS.INT

From here, we wanted to look at the differences in ranks when plotted against each other, and look at the size of residuals. We were hoping that the size of the residuals could highlight tissue-specific network differences, stock specific differences, and modules that differ in both tissue and stock.

I've already conducted this preservation analysis with the 4-dataset multiExpr dataset, although I wasn't sure if the resulting consensus network was valid given the power analyses as shown above (these are the results of a signed network, with a majority of genes in the grey "module"). Is such a thing possible using the consensus network you suggest, built from just two datasets (BRN & INT)?

Furthermore, we do have another variable (time series). This variable explains very little variation relative to tissue (BRN/INT) and stock (LS/WS) in PCAs, but we will be using this to find modules that respond to a three way interaction of time-point:tissue:stock via an LRT {lmtest} with module eigengenes. Perhaps this will have to be the primary analysis if the above module preservation analysis isn't possible given our variance in the INT samples. If you can think of anything else we should do with time-point earlier in the WGCNA pipeline beyond module-trait correlations, we would be happy to hear it. Thanks again.

1
Entering edit mode

For example, we want to look at module preservation between:

WS.BRN vs WS.INT, LS.BRN vs LS.INT, WS.BRN vs LS.BRN, WS.INT vs LS.INT

I've already conducted this preservation analysis with the 4-dataset multiExpr dataset, although I wasn't sure if the resulting consensus network was valid given the power analyses as shown above (these are the results of a signed network, with a majority of genes in the grey "module")

You can't build a WS.BRN network or any network for each of the four datasets because the modules you will get from these are the results of an unwanted/unknown source of variations, or just random noise (that would explain why most genes cluster in the grey module). Why don't you look at the heatmap of the modules of the WS.INT network and tell me if their expression profiles make sense to you.

I think that the only possible comparison in the module preservation analysis are the following:

1. BRN_network vs INT_network: check if the BRN WS.LS modules are reproducible in the INT dataset and viceversa
2. BRN_network vs BRN.INT_consNet: check if the BRN WS.LS modules are reproducible in the BRN.INT_consNet and viceversa
3. INT_network vs BRN.INT_consNet: check if the INT WS.LS modules are reproducible in the BRN.INT_conNet and viceversa
0
Entering edit mode

Ok, thank you for the clarification. I'll play around with the 2-dataset consensus network then and see what kind of interesting patterns we find. Again, all your input is greatly appreciated.