Question

WGCNA - GTEx RNASeq - Help choosing a soft power

0

Entering edit mode

2.8 years ago

Branden • 0

Hello all,

I have looked through the many existing posts on soft power selection in WGCNA, but unfortunately wasn't able to determine a solution to my problem. In brief, I cannot achieve a signed scale free topology R^2 of 0.8 or higher without having a very high soft power. I am conducting an exploratory analysis of the gene expression data for the skeletal muscle samples. To summarize, this is what I have done:

Imported the public GTEx TPM data, selected just the skeletal muscle data, and normalized via log2(TPM+1); total genes = 56,200, samples = 803.
Excluded all genes with near 0 variance and those with mean log2(TPM+1) <= 0.5, on the basis of this histogram, leaving 16,089 genes, 803 samples:
Computed the estimated soft power (signed network) on the remaining genes and plotted as usual:
At this point you can see that I need a power of 26 to even hit 0.8 on the measure of scale free topology, and the connectivity has dropped off a fair bit by then. So I started wondering what global drivers of gene expression might exist (as discussed in the WGCNA FAQ and elsewhere), and how to deal with them. I plotted the dendrogram along with a trait heatmap for any trait info I thought might be relevant. Sample clustering is by average Euclidean distance after the log2(TPM+1) transform:

As you can see, there are some definite clusters and it looks like they may be related to the terminal phase duration (Hardy score) and the tissue ischemia time, which each overlap quite a bit. The turquoise bands in the Hardy score represent the ventilated subgroup, so it's sadly not surprising that they have the lowest ischemia time. Having said all that, this kind of analysis is new to me, so I'm not sure how to adjust for these factors, which likely(?) are responsible for the high soft power. I tried re-running the soft-power calculations for just the ventilated subgroup, but didn't get significantly different results.

Thanks to anyone that read this far.. I'm not averse to creating multiple networks, but I'd like to have confidence in selecting my soft-power(s). I am considering a soft power of 12-16, as they are near the recommended sample size of 12, and while they have low signed scale free topology R^2 values, the mean and median connectivity values look o.k. Alternatively, I could use a soft power of 26, which gets the signed scale free topology R^2 up to ~0.8, but lowers the connectivity considerably.

I'd appreciate any input as far as a specific power to select, or other things to explore as far as correcting for covariates, etc.

Thank you!

WGCNA RNA-Seq GTEx • 1.2k views

ADD COMMENT • link 2.8 years ago by Branden • 0

score 2 · Accepted Answer · 2021-07-02

2

Entering edit mode

2.8 years ago

peter.langfelder ▴ 80

I don't have much experience with muscle data but in analyses of other tissues I tend to judge the data less by the scale free topology fit and more by the mean or median connectivity. I like the mean connectivity to be between say 30 and 100, and you need a large power to get there (especially for several hundred samples). The high connectivity means a lot of global variation - batch or other technical effects or perhaps some strong biological driver that would probably be better adjusted away.

I would remove outliers (the small cluster on the right) and then adjust for one or two leading principal components or, if you want to preserve the effect of ischemic time/Hardy score, use Surrogate Variable Analysis (R package sva) and adjust for one or two leading surrogate variables.

ADD COMMENT • link 2.8 years ago by peter.langfelder ▴ 80

0

Entering edit mode

Hi Peter,

Thanks very much for your advice, it's really helpful! I was thinking that prioritizing mean / median connectivity might make sense but didn't have enough confidence in my theoretical understanding of the method to justify it, so putting some actual numbers to it is great. I will likely go with a power around 16 then, as it appears to be the sweet spot.

By the cluster on the right, I assume you mean the one just up and right slightly of the large middle cluster, not the small group of a handful of samples in the upper right?

I'm not familiar with adjusting for the leading principal components or SVA so I'll have to do some reading on those. If I proceed with analyzing the entire set at once, I think adjusting via the PCs may be best, if SVA preserves the impact of the Hardy score / ischemic time. I'll have to play around with it and see. I was also thinking that I may try to construct networks separately for the ventilated / non-ventilated subjects and do a consensus analysis, because it would be interesting, biologically, to see if there are differences in the modules and module trait relationships.

Anyways thanks again for your help, it's given me some direction and things to explore.

ADD REPLY • link 2.8 years ago by Branden • 0