Question: Consensus WGCNA: soft-threshold power selection
0
gravatar for armstrong
5 months ago by
armstrong0
armstrong0 wrote:

Hi, I'm working with meta-transcriptomic expression data from a symbiotic species (host and algal symbiont) from ten experimental cohorts (islands) each with ~9 biological replicates. I am trying to generate a set of consensus modules (shared across all ten islands) that I can than use to compare each individual island to in a manner analogous to Part II of the WGCNA tutorial "Consensus analysis of female and male liver expression data." My trouble, however, is that during cohort-wise selection of a soft power threshold, several of my island cohorts do not reach scale free topology values of ~0.9. Thus, I'm unable to select an appropriate soft power threshold for construction of the consensus gene network as per the tutorial instructions. I assume this problem has something to do with the normalization protocol I've applied to my RNA-seq data, but I'm not sure how best to proceed. Any suggestions would be greatly appreciated! I've attached my relevant code below:

# Step 1 - Load holobiont normalized expression quantitation data (TPM)

RNAseq_TPM <- read.table(argsL$readCount, sep="\t",  h=T, stringsAsFactors = F, row.names=1)

# Step 2 - Preprocessing to remove genes with fewer than 10 counts overall and with counts in less than 20% of colonies.

RNAseq_highCount <- RNAseq_TPM[rowSums(RNAseq_TPM) >= 10, ]
RNAseq_filt <- RNAseq_highCount[apply(RNAseq_highCount,1,function(x) sum(x==0))<ncol(RNAseq_highCount)*0.8,]

# Step 3 - Normalize expression data using the voom methodology (Law et al. 2014).

library(edgeR)

# specify groups to be compared (in this case islands)
ile <- substr(colnames(RNAseq_filt), 1, 3)

# Specify the model to be fitted. 
# We do this before using voom since voom uses variances of the model residuals.

d0 <- DGEList(RNAseq_filt, group=ile, remove.zeros=T)
d0 <- calcNormFactors(d0)
keep <- rowSums(cpm(d0)>10) >= 20
d <- d0[keep, , keep.lib.sizes=FALSE]
dim(d) # number of genes remaining

mm <- model.matrix(~0 + ile)
RNAseq_voom <- voom(d, mm, plot = T)$E

# Step 4 - Preparing data for downstream analysis

# Split data into separate dataframes by Island

I06_RNAseq_voom <- RNAseq_voom[,grep("I06", colnames(RNAseq_voom))]
I10_RNAseq_voom <- RNAseq_voom[,grep("I10", colnames(RNAseq_voom))]
I15_RNAseq_voom <- RNAseq_voom[,grep("I15", colnames(RNAseq_voom))]
I04_RNAseq_voom <- RNAseq_voom[,grep("I04", colnames(RNAseq_voom))]
I01_RNAseq_voom <- RNAseq_voom[,grep("I01", colnames(RNAseq_voom))]
I03_RNAseq_voom <- RNAseq_voom[,grep("I03", colnames(RNAseq_voom))]
I05_RNAseq_voom <- RNAseq_voom[,grep("I05", colnames(RNAseq_voom))]
I07_RNAseq_voom <- RNAseq_voom[,grep("I07", colnames(RNAseq_voom))]
I08_RNAseq_voom <- RNAseq_voom[,grep("I08", colnames(RNAseq_voom))]
I09_RNAseq_voom <- RNAseq_voom[,grep("I09", colnames(RNAseq_voom))]

# Specify the number of dataframes (i.e., sets of data) created above:

nSets = 10

# For easier labeling of plots, create a vector holding descriptive names of the two sets.

setLabels = c("I06_RNAseq_voom", "I10_RNAseq_voom","I15_RNAseq_voom","I04_RNAseq_voom","I01_RNAseq_voom","I03_RNAseq_voom","I05_RNAseq_voom","I07_RNAseq_voom","I08_RNAseq_voom","I09_RNAseq_voom")
shortLabels = c("I06", "I10","I15","I04","I01","I03","I04","I05","I07","I08","I09")

# Create a dataframe containing the multi-set data:

multiExpr = vector(mode = "list", length = nSets)

multiExpr[[1]] = list(data = as.data.frame(t(I06_RNAseq_voom)))
names(multiExpr[[1]]$data) = rownames(I06_RNAseq_voom)
rownames(multiExpr[[1]]$data) = colnames(I06_RNAseq_voom)
etc.......

After generating this multiExpr object, I then proceed to step Tutorial step 2.a.1 Choosing the soft-thresholding power as follows:

powers = c(c(1:10), seq(12,20, by=2));
    # Initialize a list to hold the results of scale-free analysis
      powerTables = vector(mode = "list", length = nSets);
    # Call the network topology analysis function for each set in turn
      for (set in 1:nSets){
      powerTables[[set]] = list(data = pickSoftThreshold(multiExpr[[set]]$data, powerVector=powers,verbose = 2)[[2]])}

However, a plot of the results suggests that my normalization methodology worked well for only three cohorts (I06, I10, and I15) whereas the others do not seem to reach appropriate scale free topology values.

Scale Free Topology

Thanks in advance! Eric

wgcna • 304 views
ADD COMMENTlink written 5 months ago by armstrong0

Doesn't the tutorial mention to use rlog or vst normalized values as input?

ADD REPLYlink written 5 months ago by piyushjo520

It does, but DESeq2 does not accept TPM data as an input becasue the TPM expression count values aren't integers. Thus, I decided to go the voom transformation route.

ADD REPLYlink written 5 months ago by armstrong0
1
gravatar for andres.firrincieli
5 months ago by
andres.firrincieli820 wrote:

I do not get the filtering steps 2 and 3. First, unless you have the exact lenght of the transcripts I would not use TPM on meta-transcriptomic data. Second, your transcripts are already scaled by a factor of 1 million (TPM) but then you use cpm() on TPM normalized read count to filter out low expressed genes. This could completely mess up the normalization.

Personally, I would start with the raw read count data, then use cpm() to filter low count genes and then use the vst() - rlog() transformation

ADD COMMENTlink written 5 months ago by andres.firrincieli820

I see your point about doubling down on the "per million" normalization step. However, I started with the TPM matrix because using raw counts greatly reduced the scale free topology values for all cohorts. Here's an example using raw counts and vsd:

<h6>Step 1 - Load holobiont expression data</h6>
RNAseq_raw <- read.table(argsL$readCount,  h=T, stringsAsFactors = F, row.names=1)
<h6>Step 2 - Prepare DESeq Object</h6>
library("DESeq2")

colData <- data.frame(Colony = substr(unique(colnames(RNAseq_raw)), 1, 9), 
                      Ile = substr(unique(colnames(RNAseq_raw)), 1, 3))

dds <- DESeqDataSetFromMatrix(countData = RNAseq_raw, colData = colData, 
                              design = ~ Ile)
<h6>Step 3 - Prefilter data to remove low count genes and perform DESeq</h6>
dds <- estimateSizeFactors(dds)
dds <- dds[rowSums(counts(dds, normalized=TRUE)) >= 10,]

dds <- DESeq(dds)
<h6>Step 4 - Normalize expression data</h6>
RNAseq_vsd <- assay(varianceStabilizingTransformation(dds, blind=FALSE))

Using this normalized count matrix as the input for the scale free topology analysis yields more consistent curves between the cohorts, but no data set reaches the recommend 0.9 cutoff. Scale Free Topology After VSD Transformation

ADD REPLYlink modified 5 months ago • written 5 months ago by armstrong0
2

In my experience the recommended cutoff depends upon the data and is not a standard cut off for all data. You would have to set it depending where you see the knee in the distribution, the value just below it.

ADD REPLYlink written 5 months ago by piyushjo520
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1591 users visited in the last hour