Question

Proper construction of data matrix for WGCNA (weighted gene coexpression network analysis)

1

Entering edit mode

7.8 years ago

themantalope ▴ 40

Hi All,

I would just like some clarification of terminology regarding a detail of gene coexpression network construction. Let's say I have two RNA-seq datasets, each dataset containing n replicates, and each dataset representing sequencing data from the same biological system in two different experimental conditions. How should I construct the data matrix for input to something like WGCNA if I want to analyze gene coexpression networks across experimental conditions/interventions?

What I imagine is that each row of the matrix represents data from one gene, and each column represents data collected from one of the replicates in an experimental condition. So for example, one particular row of the matrix would look like this:

          c1R1 ... c1Rn c2R1 ... c2Rn
  gene x [val, ... val, val, ... val]

Where the first column c1R1 corresponds to the data from the first experimental replicate in the first condition, and the last column c2Rn corresponds to the nth experimental replicate in the 2nd experimental condition. For coexpression analysis, each row is then correlated with every other row in a pairwise fashion, an adjacency matrix is constructed from the correlation analysis and then other analyses such as module detection can be conducted based on the resulting adjacency matrix.

I just want to verify that this is an appropriate method for organizing data if one wishes to construct coexpression networks for genes "across an intervention".

RNA-Seq coexpression WGCNA • 2.4k views

ADD COMMENT • link updated 5.0 years ago by Biostar 20 • written 7.8 years ago by themantalope ▴ 40

score 5 · Accepted Answer · 2016-07-06

Hi mantale1,

That's exactly correct. By including replicates from both conditions, network will reflect both the specific pathways that are co-regulated during your condition of interest, as well as whatever genes are constitutively expressed in the organism.

If you were to then start added samples from other unrelated conditions, you would be both improving the accuracy of the global co-expression network due to the increased information, but also would be reducing the signal resulting from the intervention you are interested in.

Couple things you might consider:

1) Depending on the number of replicates you have for each condition, you may end up with a very noisy co-expression network. Most of the methods were developed for microarray data where you are likely to have many more samples. With less then 10 replicates across both conditions, you are likely have a large number of spurious correlations.

2) You might consider filtering out genes which are not differentially expressed across your intervention. This will help both with eliminating spurious correlations, and also help to bring out the signal specifically due to the intervention.

Keith