Question: Weighted Gene Expression Network
0
shahroze7860 wrote:

I have a basic network question, I've been trying to research the typical methodology behind building a gene expression network. As I understand it so far the steps are as follows: Create pearson correlation matrix -> create adjacency matrix (weighted or unweighted) -> create topological overlap matrix (there are variations to this such as the generalized TOM). I know the correlation values will be between -1 and 1. The unweighted adjacency matrix will be 0/1 based on a hard cutoff where as a weighted will be between 0 - 1 emphasizing the differences. My question is in two parts: Can you take a weighted adjacency matrix and give it further edge weights, or edge weighting only applies to an unweighted adjacency matrix? Also, can the TOM be values between 0 - 1 or is it a matrix of only 0s and 1s?

R network wgcna • 1.3k views
modified 2.5 years ago by Kevin Blighe65k • written 2.5 years ago by shahroze7860
3
Kevin Blighe65k wrote:

Network construction is more flexible than you may imagine - at virtually every step there are multiple possible ways in which the construction can proceed. Based on the terminologies that you've used, I imagine that your main experience of networks to date has been WGCNA?

Networks can be constructed (and weighted) based on any distance metric, be it correlation, Euclidean distance, or something else. Their construction may even be guided based on known protein-to-protein and pathway interactions (as is performed with STRINGdb). The distance metric that's used can then represent the weight between any 2 vertices (vertex = node or gene), with those edges falling below a particular threshold (i.e. weight) being removed if they are weak enough. For example, we may construct a co-expression network based on pairwise correlations between all genes and then remove correlations (representing edges) that fall below absolute Pearson r=0.8, leaving only very strong connections. The point is that one can easily portray a network already based on correlations in the range -1 to +1. In fact, I have a simple tutorial for this, here: Network plot from expression data in R using igraph

Regarding TOM, this was more a term introduced by Steve Horvath for WGCNA (I believe), but the original logic behind this was mentioned in a Science publication back in 2002: Hierarchical organization of modularity in metabolic networks. As much as I'm aware, the TOM is just a term used to describe the final modules that are derived through the WGCNA process, with modules essentially being just branches in the dendrogram that are defined based on a tree 'cut height', and the 'matrix' heatmap then showing how the different modules correlate to each other. Principal components analysis (PCA) is then performed on each of these modules.

A similar logic to TOM and its modules comes with community structure identification, which I mention briefly at the end of my tutorial (Step 4).

## -------------------------

Specifically related to your questions, you can therefore weight a network in any way. The weights can represent correlation strength, Euclidean distance, or anything else such as reaction efficiency (enzymes), distance (kilometers / miles), et cetera. If you wish to set a threshold for edges to keep, like, I mentioned in my first paragraph, then you can simply dichotomous the edge weights as being:

• 0 (below set threshold; no edge)
• 1 (above threshold; edge present)

The main idea that I want you to get from reading this answer, though, is that network construction is very flexible.

Kevin

For someone like me the flexibility is sort of the problem, there's always an opportunity to make a change and observe the impact. Quick question, not sure about your particular expertise, but do you know the differences between constructing a gene expression network from Pearson correlation or Euclidean distance?

Right, and that's in part why network analysis has not made the progress that it ought to. It ends up confusing people and results can be hyper-variable even after just a few minor modifications to the parameters. It is frowned upon by some experienced statisticians that I know. The figures can look absolutely beautiful but often they lack biological meaning, or biological meaning is too difficult to make.

Pearson correlation and Euclidean distance are obviously just 2 different statistical measures, both of which are better applied to normally-distributed data. If you've got some abnormal data or your dataset is small, you should be using Spearman correlation. If you still want to apply Euclidean distance in such a situation, you can most likely get away with it through review.

Correlation can be easy to explain, as it's just 'Are these genes negatively or positively expressed together?' - this is intuitive to most people and the biological meaning is easier to grasp.

Euclidean distance, the square root of the sum of the squared distances between 2 data-points, is obviously a bit more difficult to explain and the biological meaning can immediately become lost. Yet, Euclidean distance is arguably the most common distance metric used in clustering and, from my perspective, is warranted for use in network analysis.

## -------------------------------------

If we have a small data-set of just 3 samples (cols) and 2 genes (rows):

``````expr
[,1] [,2] [,3]
[1,]    2    2    1
[2,]    2    4    1
``````

The Euclidean distance for gene1 and gene2 is:

``````sqrt( sum( (2-2)^2, (2-4)^2, (1-1)^2 ) )
 2
``````

Check:

``````dist(expr, method="euclidean")
1
2 2
``````

Spearman correlation is:

``````cor(t(expr), method="spearman")
[,1]      [,2]
[1,] 1.0000000 0.8660254
[2,] 0.8660254 1.0000000
``````

Not sure why I gave this simple example but anyway.

Kevin