WGCNA soft thresholding problem
2
0
Entering edit mode
4.3 years ago
catagui ▴ 40

Hi I am having troubles with the WGCNA soft thresholding analysis. I am getting negative values in the Scale independence plot and power 1 is =0 in the type=signed. I have attached both plots using networkType="signed" and "unsigned".

For normalizing my counts I had to add geoMeans as below because I have no rows without a 0. Also when I look at the "plotDendroAndColors" they don't group by treatments. However I am still interested in this data because are the transcriptome of an organism that live in symbiosis within my species of interest.

Below what I input and the graphs. Any ideas of what could be happening here will be helpful.

Thanks,

Catalina

 #reading counts table
counts=round(counts)
counts$low = apply(counts[,1:47],1,function(x){sum(x<=10)}) counts2<-counts[-which(counts$low>42),]
ddsCOUNTS<-DESeqDataSetFromMatrix(countData=counts2[,1:47],colData=conditions,design=~geno*cond)

# estimateSizeFactors
cts <- counts(ddsCOUNTS)
geoMeans <- apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
dds <- estimateSizeFactors(ddsCOUNTS, geoMeans=geoMeans)
rlogCOUNTS<-rlog(dds,blind=F, fitType = "local") #use blind=TRUE to not account for experimental design
dat=as.data.frame(assay(rlogCOUNTS))
datExpr = as.data.frame(t(dat[,1:47]))

# soft thresholding powers
powers = c(seq(1,20,by=1));
sft = pickSoftThreshold(datExpr, powerVector = powers, networkType="signed", verbose = 2) #want smallest value, to plateau closest to 0.9 (but still under)

par(mfrow = c(1,2));
cex1 = 0.9;
#  Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n", main = paste("Scale independence")); text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")


WGCNA RNA-Seq soft thresholding data normalization • 8.0k views
0
Entering edit mode

Please use the code formatting button (10101) to highlight your code for improved readability.

0
Entering edit mode

What exactly is your input data to WGCNA?

I see that you have:

rlogCOUNTS <- rlog(dds,blind=F, fitType = "local")


...and then:

sft = pickSoftThreshold(datExpr, powerVector = powers, networkType="signed", verbose = 2)


What is in datExpr? You should use rlogCOUNTS for WGCNA.

0
Entering edit mode

Hi Kevin,

Sorry I omitted a couple of lines. I input the counts table, filtered it for low counts, then turned it into a deseq2 object. After this I rlog transformed the data which is the datExpr matrix. I have added those lines above. Thanks

0
Entering edit mode

It looks like your rlog data is bimodal, or at least that is what I am guessing. If you plot it as a histogram, you should see 2 'humps' in the distribution.

Did you try the variance-stabilised transformation?

0
Entering edit mode

Yes I also tried vst and got another weird graph. I know that my samples have a "biased" expression, since when I do the differential gene expression analysis, I get a high response for one of my genotypes at one time-1, while the other 3 have almost 0 genes differentially expressed. However this is interesting result. But not sure if this means I can not use this data for WGCNA.

0
Entering edit mode

You are right it is bimodal.

0
Entering edit mode

I see. There is usually a bimodal distribution on the rlog counts but not this accentuated. Also, in this case, your distribution is heavily shifted toward 0 - this final point is likely the explanation of what is happening.

Was the read count low over these samples, do you know? You had mentioned that there is a 0 in each row. I see that you also filtered out those with sum <=10.

How did you process the data before DESeq2? It is RSEM isoform data obviously? I am not sure that DESeq2 was designed for isoform-level analysis, however, you could summarise these to gene-level via tximport (they go over this in the DESeq2 vignette: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html ).

0
Entering edit mode

When I tried to run the "estimateSizeFactors" I was getting the error below then follow the geoMeans solution I have above.

“Error in estimateSizeFactorsForMatrix(counts(object), locfunc, geoMeans = geoMeans) : every gene contains at least one zero, cannot compute log geometric means”.

But I am not really sure how to check the read count over the samples. The filtering was just the number I used before to filter low counts.

Yes it is RSEM, I followed the "abundance_estimates_to_matrix.pl" script in Trinity. I had this inquiry a while ago, weather I should use the isoform.counts or the gene.counts output, since my input transcriptome only has the longest isoform, meaning I have the same number of sequences in both files. The Trinity script has incorporated the tximport function to estimate the gene.counts, which uses a "gene_trans_map" file to get the counts. Then the conclusion was that I could just use the isoform.counts since I don't have multiple isoforms (there are differences in the counts between both files).

I just checked the gene.counts matrix rlog and I get the same 0 biased bimodal distribution.

0
Entering edit mode

It is worth a try (the gene-level counts). I do not know in which way this would differ from your isoform-level counts in the context of the analysis that you are conducting, though. A quick check on number of rows and the histogram should tell you that fairly quickly.

0
Entering edit mode

I get the same 0 biased. Not sure if I am explaining properly, but they both have the same number of gene/isoform Ids since I decided leave the longer isoform per gene in my transcriptome.

0
Entering edit mode

Thanks, I get that now how the gene-level is the same as isoform-level in this case. WGCNA just doesn't like those low / sparse counts though. Also, the rlog transformation in DESeq2 was made to deal with the issues relating to logging of low count data.

The only other thing I'd try is to use the normalised, unlogged counts as input to WGCNA, which is valid due to the fact that WGCNA is based on correlation. You could also apply more aggressive filters on the raw counts. I tend to filter based on mean raw count, eliminating genes that have mean <15 or even 20 in some cases.

Out of curiosity, how does a boxplot of these rlog counts look, and also the dispersion plot?

I have asked others for help too.

0
Entering edit mode

You may be interested in this: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2442-7

They indicate that EdgeR is better for low count data; however, I cannot say that they have done a great benchmark. They may look at another dataset and find that DESeq2 performs 'better' (whatever 'better' may mean...).

0
Entering edit mode

Hi Kevin,

Thanks for the help with this. Below are the rlog boxplot and dispersion plot. I just tried to input the TMM normalized matrix (got it from RSEM), also filtered as you suggested but still get those negative values (plot below).

I will have a look at the paper, and see if edgeR can help.

0
Entering edit mode

Ouch! - that data looks difficult. Have you looked at a PCA bi-plot? - see HERE. Judging by the boxplot, a few of your samples are likely outliers and affecting the data normalisation step. Keep in mind that 'outliers' may be genuinely related to biology and not technical artifacts; however, this may still affect the fitting of statistical models to such data.

0
Entering edit mode

I agree it is difficult. I can see how the pca shows what I get with differential gene expr. analyses. Were all the response happens at T2 except for geno C that also has a response at T1. This is a heat stress treatment, two time points.

Maybe I should try to get rid of the ones whose mean is above the rest? What is the minimum number of samples that I can work with for wgcna?

0
Entering edit mode

The last time that I saw 90% explained variation on PC1 was when I compared tumour genotypes to those of healthy blood cells. So, this is reflective of quite a large difference. In saying that, there is no evidence of what would be regarded as an outlier. I can only now appreciate how complex the experimental set-up is, too!

Why not run WGCNA on subsets of the cohort? What do you hope to achieve by using WGCNA?

One other thing that you could try is to convert the rlog counts to Z-scores, and then run WGCNA on those. To convert them to Z-scores (scaled by row), in R, just do:

t(scale(t(rlog)))


A histogram after that should show a very nice distribution. Logging and then transforming to Z-scale, while not common, is used in, for example, metabolomics data-analysis, and also for gene expression data when plotting in heatmaps.

0
Entering edit mode

The reason why I want to use wgcna is to compare it with the host data. These reads are from the algae that lives in symbiosis with the coral. When I analyzed the coral data I got really nice results, since I was able to get the gene modules that response to treatment and ones that are specific for the genotype independent of the treatment. Also I really liked that I could relate the phenotypic traits with the gene modules. Then comparing both host and symbiont would have been nice.

I have tried the analyses to getting only 1 timepoint but have the same trouble with the soft thresholding (boxplot below looks better?, also attached the pca). Also tried the Z-scores as the figure below (ignore the title and axis it the z-score) but also no change.

I will try your simple network, already left you a question in the protocol.

Or at the end, I must just have to compare the differential gene expression without the network..

Thanks again

0
Entering edit mode

hmm, seems like this data is just always going to prove a bit difficult.

In the plot that you show, here: C: WGCNA soft thresholding problem Could you plot the powers that appear after soft threshold of 20? Based on the trend, your first soft threshold value that passes 0.9 should be ~25? You could feasibly choose that and proceed? It is much higher than usual; however, we know that your data is of low counts.

0
Entering edit mode

I tried a couple of numbers but even 100 would not reach the 0.9 as the first figure below. Then I tried edgeR with the TMM matrix and was able to get to 0.9, however I am not sure if this is correct since I didn't transform the data and the boxplot shows all the samples at 0.

counts <-read.table("RSEM.isoform.TMM.EXPR.matrix",header=TRUE,row.names=1)
y<-DGEList(counts=counts,group=conditions$geno,remove.zeros=T) keep <- rowSums(cpm(y)>10) >= 2 y <- y[keep, , keep.lib.sizes=FALSE] y <- calcNormFactors(y) dat=as.data.frame(y$counts)


After obtaining 'dat' just did as per the first message. Also noted that if I use a stronger filter (as below), then again it does not reach the 0.9 but stays in 0.5..

keep <- rowMeans(y\$counts) >= 15
y <- y[keep,]


0
Entering edit mode

If I am honest, I would take the edgeR data and go with it. Look at it this way: if you state in your methods that you used EdgeR normalised counts as input to WGCNA, I am doubtful that many would disagree.

For the boxplot, it is typical to have outliers like that with RNA-seq data. You could set outline=FALSE when running boxplot in order to get a more clear image.

EdgeR log CPM normalised counts did not give the same for WGCNA? As you have low counts, you could add a prior count before logging like this:

cpm(y, log=TRUE, prior.count = 1)


If not 1, then add 2, etc.

0
Entering edit mode

When I to CPM the counts in edgeR then again it doesn't reach 0.9, tried different prior.count as below.

So by using the 'calcNormFactors' I am normalizing the libraries, but I also need to normalize all the counts to e.g. log? Then if I just use the 'calcNormFactors' and input these values to WGCNA it would not be appropriate?

0
Entering edit mode

Strange, and unfortunate! By all means you can still use the unlogged, normalised EdgeR counts; however, just clearly state this in the methods. As WGCNA is based on correlation, logged or unlogged counts should be fine (and this is stated by the developer(s) of the program).

1
Entering edit mode

Hi Kevin thanks for all the advice. I learned a lot of different ways to input data into wgcna. I will make sure to explain this in the methods and hopefully I will get some useful data.

You don't like WGCNA?

0
Entering edit mode

It is okay for exploratory analyses.

0
Entering edit mode

Hello catagui, I reach this 0.9 at soft threshold 43 is this okey?

1
Entering edit mode
4.1 years ago
leaodel ▴ 170

Hi catagui, I was having the same issue and setting dataIsExpr = T on pickSoftThreshold() solved things for me. Hope it helps!

0
Entering edit mode

Hi leaodel thanks for your input, I just tried your suggestion, but for my data it doesn't solve the problem. The only way for getting it to reach the threshold is as I describe above. Thanks!

1
Entering edit mode
13 months ago
ChengZ ▴ 10

Another possibility. Your expression matrix is transposed. I encountered this before.