Question: WGCNA soft thresholding problem
0
gravatar for catagui
11 days ago by
catagui30
Miami, FL, USA
catagui30 wrote:

Hi I am having troubles with the WGCNA soft thresholding analysis. I am getting negative values in the Scale independence plot and power 1 is =0 in the type=signed. I have attached both plots using networkType="signed" and "unsigned".

For normalizing my counts I had to add geoMeans as below because I have no rows without a 0. Also when I look at the "plotDendroAndColors" they don't group by treatments. However I am still interested in this data because are the transcriptome of an organism that live in symbiosis within my species of interest.

Below what I input and the graphs. Any ideas of what could be happening here will be helpful.

Thanks,

Catalina

 #reading counts table
counts <-read.table("RSEM.isoform.counts.matrix",header=TRUE,row.names=1)
counts=round(counts)
counts$low = apply(counts[,1:47],1,function(x){sum(x<=10)}) 
counts2<-counts[-which(counts$low>42),]
ddsCOUNTS<-DESeqDataSetFromMatrix(countData=counts2[,1:47],colData=conditions,design=~geno*cond)

# estimateSizeFactors
cts <- counts(ddsCOUNTS)
geoMeans <- apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
dds <- estimateSizeFactors(ddsCOUNTS, geoMeans=geoMeans)
rlogCOUNTS<-rlog(dds,blind=F, fitType = "local") #use blind=TRUE to not account for experimental design
dat=as.data.frame(assay(rlogCOUNTS)) 
datExpr = as.data.frame(t(dat[,1:47]))

# soft thresholding powers
powers = c(seq(1,20,by=1)); 
sft = pickSoftThreshold(datExpr, powerVector = powers, networkType="signed", verbose = 2) #want smallest value, to plateau closest to 0.9 (but still under)

par(mfrow = c(1,2));
cex1 = 0.9;
#  Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

networkType=unsigned networkType=signed

ADD COMMENTlink modified 10 days ago • written 11 days ago by catagui30

Please use the code formatting button (10101) to highlight your code for improved readability.

ADD REPLYlink written 11 days ago by ATpoint11k

What exactly is your input data to WGCNA?

I see that you have:

rlogCOUNTS <- rlog(dds,blind=F, fitType = "local")

...and then:

sft = pickSoftThreshold(datExpr, powerVector = powers, networkType="signed", verbose = 2)

What is in datExpr? You should use rlogCOUNTS for WGCNA.

ADD REPLYlink written 10 days ago by Kevin Blighe33k

Hi Kevin,

Sorry I omitted a couple of lines. I input the counts table, filtered it for low counts, then turned it into a deseq2 object. After this I rlog transformed the data which is the datExpr matrix. I have added those lines above. Thanks

ADD REPLYlink written 10 days ago by catagui30

It looks like your rlog data is bimodal, or at least that is what I am guessing. If you plot it as a histogram, you should see 2 'humps' in the distribution.

Did you try the variance-stabilised transformation?

ADD REPLYlink modified 10 days ago • written 10 days ago by Kevin Blighe33k

Yes I also tried vst and got another weird graph. I know that my samples have a "biased" expression, since when I do the differential gene expression analysis, I get a high response for one of my genotypes at one time-1, while the other 3 have almost 0 genes differentially expressed. However this is interesting result. But not sure if this means I can not use this data for WGCNA.

VST soft thresholding

ADD REPLYlink modified 10 days ago • written 10 days ago by catagui30

You are right it is bimodal. rlog histogram

ADD REPLYlink modified 10 days ago • written 10 days ago by catagui30

I see. There is usually a bimodal distribution on the rlog counts but not this accentuated. Also, in this case, your distribution is heavily shifted toward 0 - this final point is likely the explanation of what is happening.

Was the read count low over these samples, do you know? You had mentioned that there is a 0 in each row. I see that you also filtered out those with sum <=10.

How did you process the data before DESeq2? It is RSEM isoform data obviously? I am not sure that DESeq2 was designed for isoform-level analysis, however, you could summarise these to gene-level via tximport (they go over this in the DESeq2 vignette: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html ).

ADD REPLYlink written 9 days ago by Kevin Blighe33k

When I tried to run the "estimateSizeFactors" I was getting the error below then follow the geoMeans solution I have above.

“Error in estimateSizeFactorsForMatrix(counts(object), locfunc, geoMeans = geoMeans) : every gene contains at least one zero, cannot compute log geometric means”.

But I am not really sure how to check the read count over the samples. The filtering was just the number I used before to filter low counts.

Yes it is RSEM, I followed the "abundance_estimates_to_matrix.pl" script in Trinity. I had this inquiry a while ago, weather I should use the isoform.counts or the gene.counts output, since my input transcriptome only has the longest isoform, meaning I have the same number of sequences in both files. The Trinity script has incorporated the tximport function to estimate the gene.counts, which uses a "gene_trans_map" file to get the counts. Then the conclusion was that I could just use the isoform.counts since I don't have multiple isoforms (there are differences in the counts between both files).

I just checked the gene.counts matrix rlog and I get the same 0 biased bimodal distribution.

ADD REPLYlink modified 9 days ago • written 9 days ago by catagui30

It is worth a try (the gene-level counts). I do not know in which way this would differ from your isoform-level counts in the context of the analysis that you are conducting, though. A quick check on number of rows and the histogram should tell you that fairly quickly.

ADD REPLYlink written 9 days ago by Kevin Blighe33k

I get the same 0 biased. Not sure if I am explaining properly, but they both have the same number of gene/isoform Ids since I decided leave the longer isoform per gene in my transcriptome.

gene_counts histogram

ADD REPLYlink written 9 days ago by catagui30

Thanks, I get that now how the gene-level is the same as isoform-level in this case. WGCNA just doesn't like those low / sparse counts though. Also, the rlog transformation in DESeq2 was made to deal with the issues relating to logging of low count data.

The only other thing I'd try is to use the normalised, unlogged counts as input to WGCNA, which is valid due to the fact that WGCNA is based on correlation. You could also apply more aggressive filters on the raw counts. I tend to filter based on mean raw count, eliminating genes that have mean <15 or even 20 in some cases.

Out of curiosity, how does a boxplot of these rlog counts look, and also the dispersion plot?

I have asked others for help too.

ADD REPLYlink modified 8 days ago • written 8 days ago by Kevin Blighe33k

You may be interested in this: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2442-7

They indicate that EdgeR is better for low count data; however, I cannot say that they have done a great benchmark. They may look at another dataset and find that DESeq2 performs 'better' (whatever 'better' may mean...).

ADD REPLYlink written 8 days ago by Kevin Blighe33k

Hi Kevin,

Thanks for the help with this. Below are the rlog boxplot and dispersion plot. I just tried to input the TMM normalized matrix (got it from RSEM), also filtered as you suggested but still get those negative values (plot below).

I will have a look at the paper, and see if edgeR can help.

rlog dispersion rlog boxplot TMM normalized data

ADD REPLYlink modified 8 days ago • written 8 days ago by catagui30

Ouch! - that data looks difficult. Have you looked at a PCA bi-plot? - see HERE. Judging by the boxplot, a few of your samples are likely outliers and affecting the data normalisation step. Keep in mind that 'outliers' may be genuinely related to biology and not technical artifacts; however, this may still affect the fitting of statistical models to such data.

ADD REPLYlink written 8 days ago by Kevin Blighe33k

I agree it is difficult. I can see how the pca shows what I get with differential gene expr. analyses. Were all the response happens at T2 except for geno C that also has a response at T1. This is a heat stress treatment, two time points.

Maybe I should try to get rid of the ones whose mean is above the rest? What is the minimum number of samples that I can work with for wgcna?

PCA_rlog

ADD REPLYlink modified 7 days ago • written 8 days ago by catagui30

The last time that I saw 90% explained variation on PC1 was when I compared tumour genotypes to those of healthy blood cells. So, this is reflective of quite a large difference. In saying that, there is no evidence of what would be regarded as an outlier. I can only now appreciate how complex the experimental set-up is, too!

Why not run WGCNA on subsets of the cohort? What do you hope to achieve by using WGCNA in any case? Many don't like WGCNA and see it as a sort of waste of time because it rarely shows anything concrete. Did you have any idea about what you wanted to see from running WGCNA?

I, of course, have a much simple network analysis protocol here: Network plot from expression data in R using igraph

One other thing that you could try is to convert the rlog counts to Z-scores, and then run WGCNA on those. To convert them to Z-scores, in R, just do:

t(scale(t(rlog)))

A histogram after that should show a very nice distribution. Logging and then transforming to Z-scale, while not common, is used in, for example, metabolomics data-analysis, and also for gene expression data when plotting in heatmaps.

ADD REPLYlink written 7 days ago by Kevin Blighe33k

The reason why I want to use wgcna is to compare it with the host data. These reads are from the algae that lives in symbiosis with the coral. When I analyzed the coral data I got really nice results, since I was able to get the gene modules that response to treatment and ones that are specific for the genotype independent of the treatment. Also I really liked that I could relate the phenotypic traits with the gene modules. Then comparing both host and symbiont would have been nice.

I have tried the analyses to getting only 1 timepoint but have the same trouble with the soft thresholding (boxplot below looks better?, also attached the pca). Also tried the Z-scores as the figure below (ignore the title and axis it the z-score) but also no change.

I will try your simple network, already left you a question in the protocol.

Or at the end, I must just have to compare the differential gene expression without the network..

Thanks again

histogram_zcores boxplot time 2 pca-t2

ADD REPLYlink modified 6 days ago • written 6 days ago by catagui30

hmm, seems like this data is just always going to prove a bit difficult.

In the plot that you show, here: C: WGCNA soft thresholding problem Could you plot the powers that appear after soft threshold of 20? Based on the trend, your first soft threshold value that passes 0.9 should be ~25? You could feasibly choose that and proceed? It is much higher than usual; however, we know that your data is of low counts.

ADD REPLYlink written 6 days ago by Kevin Blighe33k

I tried a couple of numbers but even 100 would not reach the 0.9 as the first figure below. Then I tried edgeR with the TMM matrix and was able to get to 0.9, however I am not sure if this is correct since I didn't transform the data and the boxplot shows all the samples at 0.

counts <-read.table("RSEM.isoform.TMM.EXPR.matrix",header=TRUE,row.names=1)
conditions<-read.table("samples_ID2.txt", header = TRUE, row.names = "ID")
y<-DGEList(counts=counts,group=conditions$geno,remove.zeros=T)
keep <- rowSums(cpm(y)>10) >= 2
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
dat=as.data.frame(y$counts)

After obtaining 'dat' just did as per the first message. Also noted that if I use a stronger filter (as below), then again it does not reach the 0.9 but stays in 0.5..

keep <- rowMeans(y$counts) >= 15
y <- y[keep,]

softthresholding of 100 boxplot edgeR edgeR sof threshold

ADD REPLYlink modified 5 days ago • written 5 days ago by catagui30

If I am honest, I would take the edgeR data and go with it. Look at it this way: if you state in your methods that you used EdgeR normalised counts as input to WGCNA, I am doubtful that many would disagree.

For the boxplot, it is typical to have outliers like that with RNA-seq data. You could set outline=FALSE when running boxplot in order to get a more clear image.

EdgeR log CPM normalised counts did not give the same for WGCNA? As you have low counts, you could add a prior count before logging like this:

cpm(y, log=TRUE, prior.count = 1)

If not 1, then add 2, etc.

ADD REPLYlink modified 4 days ago • written 4 days ago by Kevin Blighe33k

When I to CPM the counts in edgeR then again it doesn't reach 0.9, tried different prior.count as below.

So by using the 'calcNormFactors' I am normalizing the libraries, but I also need to normalize all the counts to e.g. log? Then if I just use the 'calcNormFactors' and input these values to WGCNA it would not be appropriate?

edgeR_ cpm1

ADD REPLYlink modified 4 days ago • written 4 days ago by catagui30

Strange, and unfortunate! By all means you can still use the unlogged, normalised EdgeR counts; however, just clearly state this in the methods. As WGCNA is based on correlation, logged or unlogged counts should be fine (and this is stated by the developer(s) of the program).

I say all of this within the confines of my general disapproval of WGCNA!

ADD REPLYlink written 4 days ago by Kevin Blighe33k
1

Hi Kevin thanks for all the advice. I learned a lot of different ways to input data into wgcna. I will make sure to explain this in the methods and hopefully I will get some useful data.

You don't like WGCNA?

ADD REPLYlink modified 3 days ago • written 3 days ago by catagui30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 710 users visited in the last hour