Question: subgraph extraction based on the edges weights and graph connectivity
1
13 months ago by
arronar200
Austria
arronar200 wrote:

Hi.

Given a matrix that describes the edges' and their weights of a connected graph (see below) I want to extract a subgraph based on a threshold value x for the edges' weights. In literature, I read that one can search for the maximal x, such that the induced subgraph is connected. Since the initial graph is assumed connected, there must be a critical threshold x-critical that the extracted subgraph is connected for any x <= x-critical.

I was wondering how can this implemented in R. For example, my matrix (`weights.matrix`) looks like

``````| FROM | TO | WEIGHT |
| A    | B  | 0.0042 |
| A    | V  | 0.23   |
| G    | W  | 0.82   |
| ...  | ...| ...    |
``````

and I'm creating the whole graph, by using the `igraph` package like

``````g <- igraph::graph_from_data_frame(weights.matrix, directed = TRUE)
``````

Is there any way to check repeatedly using a different threshold value for the weights from min() to max() if the occurred graph is connected? I searched in google for such feature in `igraph` but couldn't find anything helpful. So any idea on how to implement this idea is welcome.

graph weight subgraph extraction R • 1.0k views
modified 13 months ago by ejm32440 • written 13 months ago by arronar200
1
13 months ago by
ejm32440
Boston, MA
ejm32440 wrote:

This will get you started, not sure that there is a more efficient way to approaching this:

``````library(igraph)

m <- expand.grid(LETTERS[1:10], LETTERS[1:10])
m\$Weight <- runif(nrow(m), 0.01, max = 5)
m <- m[m\$Var1!=m\$Var2, ] ##remove loop edges

g <- graph_from_data_frame(m)
plot(g, edge.curved=TRUE)

g1 <- induced_subgraph(g, c("A","B","C"))
plot(g1, edge.curved=TRUE)
#E(g1)\$Weight
for(xcrit in sort(E(g1)\$Weight)){
if(any(degree(subgraph.edges(g1, E(g1)[Weight > xcrit], delete.vertices = F))==0)){
print(paste0("Xcrit = ", xcrit))
break
}
}
#[1] "Xcrit = 4.5026016204874"
``````

Thank you for your answer but it confused me a little bit. Why did you choose as initial nodes the A,B and C? For some reason, you chose the induced network to be constructed by those nodes. As I understood from the literature,the induced graph should be the result of the whole process. Why choose one from the beginning? Here is a pdf referring that technique in pdf's page 15 , section 5 the fourth paraagraph. You can consider the edge relevance as the edge weight discussed here.

2

What's written in this section of the paper is simply a thresholding of the adjacency relevance score matrix, i.e. set all values to 0 if they are below a threshold. The 4th paragraph proposes an automatic way of setting the threshold. If you're interested in the limited k-walk approach, I don't think it's available in igraph but I've implemented it in my perl module Algorithm::Graph (using C code from the authors). I haven't implemented the automatic threshold selection though but it is usually not difficult to find a reasonable threshold for a given data type. For example for large protein interaction graphs, I use a default of 0.0001

Edit: clarification

I don't think that the 4th paragraph is referring to thresholding the adjacency matrix. As I understood they propose to apply the threshold to the edge relevance scores. I have already implemented the k-walk algorithm and the results I got are in the form of the dataframe described in the OP. So now I need some help on how to implement this auto-thresholding. It's kinda confusing me. If you have any idea... :-)

What's this 0.0001? A threshold for the adjacency matrix or for the relevance scores? Here is how the distribution of my edge relevance scores looks like. You suggest applying 0.0001 to this?

1

Sorry about the confusion. I meant thresholding of relevance scores (which you can view as new edge weights). I've mostly used this for protein interaction graphs so I don't know if the default I use is applicable to other graphs. I usually define a threshold based on exploration of the returned graph, not just on the distribution of values. One way of approaching the problem of finding a threshold automatically could be to find the edges that if removed would disconnect the graph (they are called bridges and I think there's an algorithm for finding them).

One way of approaching the problem of finding a threshold automatically could be to find the edges that if removed would disconnect the graph (they are called bridges and I think there's an algorithm for finding them).

Isn't what the fourth paragraph is talking about? Also based on what you chose the 0.0001 for protein-protein networks?

Yes to the first question. Looking for the smallest value associated with bridges will give you the threshold. My value selection is based on inspection of the graphs returned by the algorithm for different thresholds.

What kind of inspection? Are you looking for graphs with a specific number of expanded nodes or anything different?

- when querying with components of some known complexes or set of interactions, how many of the other known components do we recover in the subgraph ?
- the subgraphs should be of manageable size

You can have different criteria depending on what the graph represents and what is important for you.

I understood your question to be: "Given a set of nodes in a graph, I want to find the minimum score at which the these nodes are fully connected (i.e. an induced subgraph)." I will look at the paper and edit my answer accordingly.