Question

WGCNA on gene abundance data

0

Entering edit mode

9.1 years ago

cara78 ▴ 10

I am trying to do co-occurrence analysis using WGCNA. My data is foreach species I have the abundance counts. Before I began WGCNA me data was log transformed and missing data is NA. I do have a lot of NAs in the dataset but they are consistently spread out across the samples.

I have been using the tutorial on youtube and also look at the tutorial on the WGCNA website. But because all the examples and code is for microarray and RNA seq; I am confused and unsure if it is going well.

My problem is when I try to get a power for the soft threshold. In the plot "Scale-free topology fit index as a function of the soft-thresholding power" my R2 scale only goes from -0.6 to 0.2. I am concerned about this because in the tutorials its 0 to 0.8. And my SFT is as follows,

Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.

1 1 0.65300 1.200 0.8920 94.6 109.0 205

2 2 0.51500 1.830 0.8790 70.8 86.9 151

3 3 0.61800 7.720 0.8890 58.3 72.9 121

4 4 0.27400 1.470 0.8410 50.8 63.7 115

5 5 0.10900 0.917 0.8940 46.0 57.0 115

6 6 0.02440 0.452 0.8140 42.7 52.2 114

7 7 0.00366 -0.181 0.7100 40.3 49.2 114

8 8 0.01140 -0.316 0.6750 38.5 46.6 114

9 9 0.12700 -3.630 0.0687 37.1 44.0 114

10 10 0.14400 -3.890 0.0761 35.9 42.4 113

11 11 0.14800 -3.910 0.0832 35.0 41.1 113

12 12 0.15200 -3.950 0.0909 34.3 39.9 113

13 13 0.17400 -4.310 0.0951 33.6 38.9 113

14 14 0.17600 -4.310 0.0986 33.1 38.1 113

15 15 0.23500 -1.400 0.6150 32.6 37.5 113

16 16 0.26900 -1.510 0.6070 32.2 36.9 113

17 17 0.31700 -1.650 0.6100 31.9 36.4 113

18 18 0.32400 -1.660 0.6110 31.5 35.9 113

19 19 0.33000 -1.700 0.5950 31.3 35.6 113

20 20 0.33500 -1.750

I am also getting 24 warnings

Warning messages:

1: In eval(expr, envir, enclos) : Some correlations are NA in block 1 : 473 .

2: In eval(expr, envir, enclos) : Some correlations are NA in block 1 : 473 .

3: In eval(expr, envir, enclos) : Some correlations are NA in block 1 : 473 .

4: In eval(expr, envir, enclos) : Some correlations are NA in block 1 : 473 .

5: In as.vector(log10(dk)) : NaNs produced

6: In as.vector(log10(dk)) : NaNs produced

7: In as.vector(log10(dk)) : NaNs produced .....

I know another person here posted these but that was for an RNAseq question.

In the help page https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html at point 6 it says to take power 14 in my case if "If the scale-free topology fit index fails to reach values above 0.8 for reasonable powers". But I want to make sure that is the case and I dont have a problem.

Should I not have log transformed the data?

My code is as follows but most of it comes from the tutorials:

library("flashClust")

library(WGCNA)

library(cluster) 

options(stringsAsFactors = FALSE)
allowWGCNAThreads()

z <- list.files(pattern = "*.csv")
data <- lapply(z, function(x) {temp <- readLines(x);
           read.table(text = temp[2:length(temp)], sep = ",", col.names = c("Species", temp[1]))})

d <- Reduce(function(x, y) merge(x, y, all=TRUE), data)
# ##mat_data is now a matrix with names back on 
rnames <- d[,1]
matdata <- data.matrix(d[,2:ncol(d)])
rownames(matdata) <- rnames

# ##convert matrix to dataframe like tutorial
data1 <- as.data.frame(matdata)

# ##we transpose the data just for this section in order to check
data2 <- t(data1)

# ## sample network based on squared Euclidean distance 

A=adjacency(data1,type="distance") 

# ## this calculates the whole network connectivity 
k=as.numeric(apply(A,2,sum))-1 

# ## standardized connectivity 
Z.k=scale(k) 

# ##Designate samples as outlying 
tthresholdZ.k=-2.5 

sampleTree = flashClust(as.dist(1-A), method = "average") 

# ## Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# ##The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
pdf(file = "sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)
dev.off()

# ##Remove outlying samples from expression data 
remove.samples= Z.k<tthresholdZ.k | is.na(Z.k) 
datExprSAM=data2[!remove.samples,] 

# ## Recompute the sample network among the remaining samples
# ## Tranpose datExpr so set up as data1 
AA=adjacency(t(datExprSAM),type="distance") 
# ##Let's recompute the Z.k values of outlyingness 
kk=as.numeric(apply(AA,2,sum))-1 
Z.k=scale(kk) 

# ## calculate the cluster tree using flahsClust or hclust
sampleTree2 = flashClust(as.dist(1-AA), method = "average") 

# ## Plot the sample tree: Open a graphic output window of size 12 by 9 inches

sizeGrWindow(12,9)
pdf(file = "sampleClustering_deleted.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree2, main = "Sample clustering to display detected outliers", sub="", xlab="", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)
dev.off()

# ## Choose a set of soft-thresholding powers
powers=c(1:20)
# ## Call the network topology analysis function
sft = pickSoftThreshold(datExprSAM, powerVector = powers, networkType = "signed")

# ## Plot the results:
sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
# ## Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
# ## this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")

# ## Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

wgcna gene abundance soft threshold • 5.6k views

ADD COMMENT • link 9.1 years ago by cara78 ▴ 10

0

Entering edit mode

Hi, what sort of assay are your gene abundances derived from? How many genes are you looking at and how many samples do you have?

Also, please put your code in a "code sample" block (available as a button with the 101 in the post editor) so that it's a bit easier to read here.

ADD REPLY • link 9.1 years ago by Adamc ▴ 680

0

Entering edit mode

It was sequenced ribosomes Dna and the counts were determined from researching papers where the copy number was determined. I have 473 genes and 48 samples. I originally had 50 sa,pleas but they were removed after the quality control step. Sorry about the code and thanks for the tip to fix it.

ADD REPLY • link 9.1 years ago by cara78 ▴ 10

1

Entering edit mode

Do you have counts for each sample then? Sorry, having a bit of trouble understanding your description. Are your values for gene counts derived from different sources (counts for gene A were derived from one publication, and for gene B from a different publication, etc)? I think a simpler clustering approach might be best as a first pass, there are many out there which are much faster to work with than WGCNA- see something like MFuzz http://bioconductor.org/packages/release/bioc/html/Mfuzz.html WGCNA is typically appropriate for very large datasets, which meet the "scale free topology" assumption- otherwise it's unlikely that the resulting modules will be stable (as in small changes in parameters end up changing the results quite a bit).

ADD REPLY • link 9.1 years ago by Adamc ▴ 680