Question: String - Number Of Intermediate Interactors Between Proteins
2
gravatar for pfbusson
6.2 years ago by
pfbusson20
France
pfbusson20 wrote:

Hello,

Using STRING-DB, I need to build a data frame containing for each pair of proteins the number of interactors between them...

For example, if protein A and protein B have a common line in the node_node_links table, they are in direct interaction, and the number of intermediates will be 0. If protein A and protein D are not in direct interaction, but protein A interacts with protein C which in turn interacts with protein D, the number of intermediates between A and D will be 1. And so on, for the 20770 proteins...

A B 0
A D 1
...

I wonder if any of you ever faced this problem and would be able to give me a time and memory efficient solution.

Thanks !

interaction protein network • 3.5k views
ADD COMMENTlink modified 6.2 years ago by Arnaud Ceol850 • written 6.2 years ago by pfbusson20
1

"data frame" suggests you want a solution in R?

ADD REPLYlink written 6.2 years ago by Neilfws48k

yes Neil, I know him: that's what pfbusson wants.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Pierre Lindenbaum127k
1
gravatar for B. Arman Aksoy
6.2 years ago by
B. Arman Aksoy1.2k
New York, NY
B. Arman Aksoy1.2k wrote:

If I were to do it now, I would pick either of the following two methods:

1) Calculate the distance for each node pair using a graph library

This approach requires you to convert the data into a graph model for your library of choice. I previously did a similar thing with Jung, which is Java-based, and was really happy with it. It is hard to estimate the required memory, but I was able to calculate the length of shortest path for all possible node pairs on a PPI network I obtained from GeneMania (human) in a few hours on a standard laptop -- so it was not that bad.

If you want to see an example, you can check out this repository for an old project of mine and you will find my Java implementation that utilizes Jung (the documentation is a little bit scarce, but the code is simple):

https://bitbucket.org/armish/pathwayguy

and here is the code that calculates the pairwise shortest-path distances:

AbstractPairwiseDistanceMethod.java

I am guessing you can also go with the R-library igraph to do this calculation; I never tried it, but from the things I heard from my colleagues, I can tell it is a good alternative to Jung (especially if you are an R-person).

2) Work with adjancecy matrix

This requires you to convert the interaction file into an adjacency matrix (all nodes vs all nodes) and then multiplying it by itself, i.e. taking powers. I believe this is also not that memory intensive, especially if you go with a matrix multiplication method that is optimized for sparse matrices (R and Matlab both have many of those). For this alternative, the idea goes like this:

  • You create the adjancency matrix and every non-zero value in a cell indicates a distance of 0 for the corresponding node pair (A^1)
  • You multiply the matrix by itself and now every non-zero value in a cell indicates a distance of 1 for the corresponding node pair (A^2)
  • You multiply the matrix by itself and now every non-zero value in a cell indicates a distance of 2 for the corresponding node pair (A^3)
  • You multiply the matrix by itself and now every non-zero value in a cell indicates a distance of 3 for the corresponding node pair (A^4) ...

It is not a great approach, but surprisingly people who are familiar with matrix-based operations in either R or Matlab find this approach relatively easier to implement.

ADD COMMENTlink modified 3 months ago by RamRS26k • written 6.2 years ago by B. Arman Aksoy1.2k

+1 for suggesting igraph

ADD REPLYlink written 6.2 years ago by Neilfws48k
1
gravatar for Neilfws
6.2 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I don't have the node_node_links table, but let's assume that you can obtain the partners as a tab-delimited text file like this:

A    B
A    C
B    C
C    D

Then a graph can be created using R igraph:

library(igraph)
d <- read.table("myfile.txt", sep = "\t", header = F, stringsAsFactors = T)
g <- graph.data.frame(d)

A list of all igraph methods is here. Something like what you want can be achieved using shortest.paths():

shortest.paths(g)
  A B C D
A 0 1 1 2
B 1 0 1 2
C 1 1 0 1
D 2 2 1 0

I don't know how long this would take using the STRING database; it could potentially be a very long time.

ADD COMMENTlink modified 3 months ago by RamRS26k • written 6.2 years ago by Neilfws48k
1

the input would be ftp://string-db.org/STRING/9.1/human.links.txt.gz

ADD REPLYlink written 6.2 years ago by Pierre Lindenbaum127k
2

OK; just ran the above code using that data on a work server (24x X5680 Xeon CPU, 100 GB RAM); took about 15-20 minutes.

ADD REPLYlink written 6.2 years ago by Neilfws48k

This is exactly what I was looking for. Thanks a lot !

ADD REPLYlink written 6.2 years ago by pfbusson20
0
gravatar for Pierre Lindenbaum
6.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum127k wrote:

prior warning: pfbusson works in my lab.

I've implemented a java idea: https://github.com/lindenb/jvarkit/wiki/Biostar92368

Put all the interactions in a berkeleyDB with key=protein, value=set(interactors)

(...)
        while(r.hasNext())
            {
            String line=r.next();
            String tokens[]=tab.split(line);
            for(int i=0;i< 2;++i)
                {
                String a=tokens[i==0?0:1];
                String b=tokens[i==0?1:0];
                StringBinding.stringToEntry(a, key);
                OperationStatus status=c.getSearchKey(key, data, LockMode.DEFAULT);
                Set<String> partners; 
                if(status==OperationStatus.SUCCESS)
                    {
                    partners=bindingSet.entryToObject(data);

                    }
                else
                    {
                    partners=new HashSet<String>(1);
                    }
                partners.add(b);
                bindingSet.objectToEntry(partners,data);
                if(status==OperationStatus.SUCCESS)
                    {
                    status=c.putCurrent(data);
                    }
                else
                    {
                    status=c.put(key, data);
                    }
                }
            }
(...)

and then recursively scan all the paths to go from protein1 to protein2:

private int recursive(
        Transaction txn,
        final String endProt,
        Stack<String> path,
        int bestDepth,
        final int maxDepth
        )
    {
    DatabaseEntry key=new DatabaseEntry();
    DatabaseEntry data=new DatabaseEntry();
    StringBinding.stringToEntry(path.lastElement(), key);
    if(this.database.get(txn,key,data,null)==OperationStatus.SUCCESS)
        {
        Set<String> partners=this.bindingSet.entryToObject(data);
        for(String prot2:partners)
            {
            int depth=-1;
            if(prot2.equals(endProt))
                {
                depth=path.size()-1;
                }
            else if(path.size()<maxDepth && !path.contains(prot2))
                {
                path.push(prot2);
                depth=recursive(txn,endProt,path,bestDepth,maxDepth);
                path.pop();
                }

            if(depth!=-1 && (bestDepth==-1 || bestDepth>depth))
                {
                bestDepth=depth;
                }
            }
        }
    return bestDepth;
    }

Example:

$ cat input.txt
A1  A2
A2  A3
A3  A4
A4  A5
A1  A6

$ mkdir -p tmp
$  java -jar dist/biostar92368.jar -D tmp input.txt

A1  A2  0
A1  A3  1
A1  A4  2
A1  A6  0
A2  A3  0
A2  A4  1
A2  A5  2
A2  A6  1
A3  A4  0
A3  A5  1
A3  A6  2
A4  A5  0
ADD COMMENTlink modified 3 months ago by RamRS26k • written 6.2 years ago by Pierre Lindenbaum127k
0
gravatar for Arnaud Ceol
6.2 years ago by
Arnaud Ceol850
Milan, Italy
Arnaud Ceol850 wrote:

The networkx library for python has a method for it: networkx.shortest_path_length(self.network) http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html

ADD COMMENTlink written 6.2 years ago by Arnaud Ceol850
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1621 users visited in the last hour