Question

Issues when scaling the TOM:s for consensus network construction with WGCNA

0

Entering edit mode

7.4 years ago

correlationmatrix ▴ 20

Hi!

I am attempting to perform a consensus co-expression network analysis with WGCNA based on two datasets (same experiment). Everything seems alright until I reach the step where one should scale the topological overlap matrix of dataset 2 to be comparable to the one derived from dataset 1.

Specifically, in the associated tutorial for the WGCNA package, https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/Consensus-NetworkConstruction-man.pdf), there is a reference plot which shows a nice linear relationship between the quantiles of the TOM:s derived from each group:

QQ-plot, TOM, tutorial

In my data, however, the corresponding plot looks far from ideal, as the two TOM:s deviate quite a lot even after scaling:

QQ-plot, TOM, my data

My questions are therefore the following:

1) Is this a problem I can ignore? (I have not been able to find a single publication where a Q-Q plot of the TOM:s is present, nor any question related to this issue on any forum; suggesting that people either tend to ignore similar issues, never bother investigating the matter in the first place, that everyone has ideal data and never runs in to it, or that there is some trivial mistake related to how I preprocess my data that causes this).

2) How can I solve the issue? (Assuming that I can)

Additional details:

The experiment consists of two groups of samples (48x2) sequenced using RNA-seq. The goal is to find similarities and differences between these two groups in relation to a set of clinical traits. If it is relevant, the variance in one of the groups is much smaller than in the other (for biological reasons).
The gene expression data is originally count based, but then transformed using the "rlog" function of the DESeq2 package (blind=TRUE). Attempts with "varianceStabilizingTransformation" gives similar, if not worse, results.
Genes with expression less than 1 RPM in more than 90% of the samples, and CV less than Q1 of the CV distribution are filtered out (yielding ~13000 remaining genes for analysis). (The RPM calculation is performed separately to the rlog transformation, which is applied to the raw counts).
Both datasets pass the scale-free topology criterion fine. A soft power of 5 gives a fit of over 0.9 for signed networks derived from each dataset.

WGCNA • 2.3k views

ADD COMMENT • link 7.4 years ago by correlationmatrix ▴ 20