Question: How to compare 2 separate coexpression networks each with the same genes?
1
O.rka120 wrote:

I have 2 coexpression networks: (1) control phenotype; and (2) treatment phenotype. They are using the same subset of genes so they are directly comparable (n=1700). I used WGCNA to create 2 separate coexpression networks using the same soft power threshold (beta=10) for both. This question is more conceptual than actual commands or tools to use but I am looking for methods that can allow me to compare the differences in network topology between the 2 phenotypes. My first idea was to calculate the differential coexpression network as suggested in this paper https://www.nature.com/articles/srep13295 but it's a little weird to work with the resulting matrix. The only other approach I could think of was to have a pairwise analysis of modules between each network to get an overlap measure.

Does anyone know of any methods, tools, or approaches that work well for datasets like this?
My main research question is to identify the biggest changes in network topology between the 2 phenotypes.

Dataset description:

It's a dataset with 49 samples of patients without a disease and 34 with a disease. 28 of the patients have a twin in the dataset (n=28*2=56) and 27 do not have a twin account for in the dataset for a total of 56 + 27 = 83 samples/patients. The gene subset I'm investigating is 1700 so each network consist of a 1700x1700 symmetrical similarity matrix. There is a separate network for the healthy state and a separate one for the diseased state.

modified 19 months ago • written 20 months ago by O.rka120
1

I have noticed somewhere people topologically analyze each network independently (for example node degree, betweeness centrality, etc) and remove common the highest degrees (for instance) and keep only the genes with the highest centrality value as treatment specific hub genes (differential connectivity). In WGCNA, once I noticed people used control network as a reference network to calculate z-score preservation that will find modules those are more related to the treatment.

That's pretty interesting to use the control network as a base for looking at distributions. Thanks! What about module comparison?

1

As I mentioned WGCNA by using control network as reference can tell us which module is being more dysregulated in treatment. Higher z score, more a module related to the treatment

I'm having trouble finding references for this approach but it seems very useful. I have been having difficulty relating modules to categorical metadata. Is this what you are referring to above? Or do you mean investigating a particular attribute, e.g. intramodular connectivity of genes, and use all of the values from the control network as a background distribution and do a mannwhitney (or similar) for the module of interest against the background distribution? Sorry to ask so many questions but I just want to make sure I understand what you are suggesting.

ADD REPLYlink modified 19 months ago • written 19 months ago by O.rka120

Exactly I mean what you mentioned " investigating a particular attribute, e.g. intramodular connectivity of genes, and use all of the values from the control network as a background distribution and do a Mann-Whitney (or similar) for the module of interest against the background distribution". However, the outcome of this procedure just would be; for instance, if the blue module in your network seem to be correlated to the treatment significantly, by this procedure (using control network as reference) you can state this correlation is meaningful not just by chance (higher z-score). Please notice that " relating modules to categorical metadata" step is performed beforehand to find correlated modules to your interesting trait. Personally for my purpose as I don't have access to a trait file (fore example measuring cholesterol, etc) I have to prepare a binary trait file (to show which module or genes are related to the treatment). However, I have already accumulated WGCNA codes for each step and if you want I can share with you but you must change the input, etc based on your purpose.

If you have code snippets that can illustrate how to correlated the eigengenes with categorical data like treatment that would be very useful. I'm doing most of the computation outside of R in Python so a good understanding of the methods is crucial. For example, if I have continuous data then I just calculate the 1st principal component of the module and then pearson correlation with the continuous metadata but you can't do that with the categorical data.

My codes are in R My traits is binary, I then made a 1 and 0 trait file and use

moduleTraitCor = cor(MEs, datTraits, use= "p")

for correlation as mentioned in WGCNA; although I am not sure if I am correct

https://ibb.co/mftzq7

ADD REPLYlink modified 19 months ago • written 19 months ago by A3.5k

I'm getting values for this but I don't know if it makes sense since we wouldn't necessarily expect there to be a linear correlation between a continuous variable and a categorical variable right?

Actually this is also my question I read somewhere people use biweight midcorrelation (bicor) function instead of Pearson cor function Or if I am not wrong take log of binary trait file. However I got confused which is correct... please share with me if you find the answer Thanks

ADD REPLYlink modified 19 months ago • written 19 months ago by A3.5k

As the same with module preservation (comparing treatment network with control network), comparing modules with each other (conservation) will give you conserved genes between two interesting modules; implemented in WGCNA

4
Jean-Karim Heriche21k wrote:

You could simultaneously cluster the two graphs using a tensor factorization approach. For more details, see some notes from a workshop I taught.

Looking forward to looking going very deep into this monday morning when I get to lab. Thanks for posting this and making such a detailed tutorial. Do you think it would work well if there are only 2 networks? I'm looking at the examples I think the test case is (30,30,6)?

Actually, when I post a question I used to describing the experimental design, number of samples, what the treatment is, ect, This Monday morning if you could, that would be great to describe all. By the way I would be so grateful to share whatever I know with you.

Step 4 in the below link is about comparing networks

ADD REPLYlink modified 19 months ago • written 19 months ago by A3.5k

I will give more details and input on monday but I've updated the question to include the dataset description. Thanks

It would work even with just one graph. The question is whether there is enough structure in the data for the factorization to reveal it.

Do you know of any tutorials that explain the resulting dimensionality from the transformations? Before I go deep into the wikipedia page.

1

I am not sure what you mean by "explain the resulting dimensionality" but I am assuming your question is about how to interpret the resulting factor matrices. For a genes x genes x conditions tensor, a CP factorization with k components will give you two matrices (actually three but two are the same) of dimensions genes x k and conditions x k. The first one can be interpreted as a gene membership/activity in each of the k clusters and the second can be interpreted as the presence/activity of the k clusters in the different conditions.

When you say "2 are the same", do you mean they are supposed to be identical? I tried using sktensor but I wasn't sure if the implementation was similar to rTensor in Python and R, respectively. I am using the parafac method in tensorly with a subset of the iris dataset to test. Using sertosa [0:50] for network A and versicolor [50:100] for network B. In the case below, I would have 4 "genes", 2 conditions, and 2 components/clusters. How can I interpret these matrices? The first 2 matrices are the same shape but they are not equal to each other. Also, the last matrix has all negative values whose magnitudes do not scale to one. Should I start a new thread for this conversation?

df_sim_A = X_iris.iloc[0:50,:].corr()
df_sim_B = X_iris.iloc[50:100,:].corr()

df_sim_A.shape, df_sim_B.shape, df_sim_A.equals(df_sim_B)

((4, 4), (4, 4), False)

import tensorly as tl
from tensorly.decomposition import parafac
T = tl.tensor(np.stack([df_sim_A, df_sim_B]).transpose(1,2,0))
P = parafac(T, rank=2)

print([x.shape for x in P])
# [(4, 2), (4, 2), (2, 2)]

for x in P:
print(x)

# [[-1.64835237 -0.42324339]
#  [-1.54643444 -0.47574374]
#  [-1.96478944  0.24035087]
#  [-1.90195131  0.14699458]]
# <NDArray 4x2 @cpu(0)>

# [[ 0.46201557  0.66333921]
#  [ 0.43344127  0.74557299]
#  [ 0.55074777 -0.37620981]
#  [ 0.53313271 -0.22997329]]
# <NDArray 4x2 @cpu(0)>

# [[-0.53452396 -1.54660926]
#  [-0.81901771 -0.55149847]]
# <NDArray 2x2 @cpu(0)>
ADD REPLYlink modified 19 months ago • written 19 months ago by O.rka120
1

The first two factor matrices should be identical up to scaling which is the case in your example. You can rescale any mode as long as the inverse scaling is applied to another mode, this doesn't change the solution to the factorization problem. Different libraries may handle the scaling differently. For example, you could also normalize the vectors in the factor matrices to length one and absorb the weights into the superdiagonal of the core tensor.

Thanks for being helpful with all of those questions. I still need to look more into the details but it's a tool that I didn't even know existed. Really inspiring.

I asked our mutual uncertainty here,

C: WGCNA modules and categorical traits relationship

A biostar thinks that no problem with Pearson correlation even for non-quantitative traits

ADD REPLYlink modified 19 months ago • written 19 months ago by A3.5k

Interesting answers. I will take a look at the regression. However, at that point it wouldn't be correlation values? Also, if you get some extra time can you take a look at this question about modulePreservation https://support.bioconductor.org/p/106661/

Kevin had used correlation itself instead of principal components in our case.

ADD REPLYlink modified 19 months ago • written 19 months ago by A3.5k

Sorry, may I please access to a sample of input file for tensor factorization as come in the workshop?

I tried to adopt your steps on my data but error says that;

> for (i in 1:length(fileList)){
+
+
+ }
Error in A[, , i] <- as.matrix(read.delim(file.path(dataDir, fileList[i]),  :
incorrect number of subscripts

I think this is a programming error. How did you define A ? If you're using triple indexing, you need A to be a 3d array and assigning a matrix to an array slice works only if the array has been pre-allocated.

Thank you for paying attention,

I extracted two matrices from my data

source("genie3.R")
library(igraph)

expr.matrix <- read.expr.matrix("bigm.txt", form="rows.are.genes")
dim(expr.matrix)
X1       X2       X3       X4
AT1G01010 10.85486 11.72960 10.74060 10.69040
AT1G01020 11.04532 10.90914 11.15791 11.00391
AT1G01030 10.43474 10.62670 10.35366 10.54900
AT1G01040 11.21519 11.00323 11.27382 11.28544
AT1G01050 12.11774 11.53287 11.82480 11.78413
AT1G01060 10.38341 10.43773 10.42992 11.22040
>
> dim(expr.matrix)
 4030  205
expr.matrix=expr.matrix[1:100,1:10]
weight.matrix <- get.weight.matrix(expr.matrix)

Gsi <- graph.data.frame(edge_listsi,directed = F)

Asi <- get.adjacency(Gsi,sparse = F,attr = "im",type = "both")

Then I wrote my adjacency to a csv file, I did this procedure two times to have two matrices for treatment and control

Then I run your code

library(rTensor)

## Where the data is

## This is a simulated dynamic graph with 100 nodes and 10 time points

Each matrix is in a csv file

fileList <- dir(path=dataDir,pattern = ".csv")

Read all adjacency matrices into an array

A <- array(as.numeric(NA),dim=c(100,100,2)) # There are 2 matrices of size 100 x 100

for (i in 1:length(fileList)){