Question: How To Turn A Two Column Data File Containing Pairs Into A series of clusters
0
gravatar for stefano.campanaro
2.1 years ago by
stefano.campanaro0 wrote:

Hi All, I intersected to data sets and I have a tabulated file with two columns like:

AA     BB

BB      CC

CC      DD

BB      AA

EE      FF

FF      GG

GG      HH

II      JJ

JJ      II

II      KK

...     ...

and I would like to convert "one-to.one interactions" in clusters considering that AA interacts with BB, BB with CC and CC with DD (so AA, BB, CC and DD form a cluster). Similarly EE, FF, GG, HH form another cluster but none of these elements interact with elements of the first cluster and so on. I would like to obtain something like

AA BB CC DD 
EE FF GG HH
II JJ KK

...

Would you please help me how I can do that?

software error • 535 views
ADD COMMENTlink modified 2.1 years ago by JC10k • written 2.1 years ago by stefano.campanaro0

Question : If somewhere in the file you have BB GG, you want to get a single cluster (AA, BB, CC, DD, EE, FF, GG and HH) ?

I would like to obtain something like AA BB CC DD EE FF GG HH II JJ KK

I don't understand this line

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Bastien Hervé4.6k

Why is this tagged as a software error question ?

ADD REPLYlink written 2.1 years ago by Jean-Karim Heriche22k
1
gravatar for Jean-Karim Heriche
2.1 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche22k wrote:

The data represents a graph in edge list format, i.e. each line is an edge of the graph specifying the two nodes that are connected. What you call clusters seems to be the connected components of this graph. So read the data into a graph structure then extract the connected components, e.g. in R with the igraph package, something like this (untested):

edge.list <- as.matrix(read.table("edge_list.txt",...)) # read the file as appropriate, turn data into a two-column matrix for use by igraph
G <- graph_from_edge_list(edge.list, directed = FALSE)
clusters <- components(G)
ADD COMMENTlink written 2.1 years ago by Jean-Karim Heriche22k
0
gravatar for Bastien Hervé
2.1 years ago by
Bastien Hervé4.6k
Karolinska Institutet, Sweden
Bastien Hervé4.6k wrote:

In pseudo code, if you want to try to write it

Initialize an 2D array : 2Darray

Import your file in a dataframe

Sort the dataframe by column one and column two, to get something like :

AA BB

BB AA

BB CC

CC DD

For each line of your dataframe

  • If it's the first line of the datafame, create and array and append first element and second element of the line
  • Else, does the first element exist in array ?
    • If yes, append array with the second element
    • Else, append 2Darray with array, reinitilize array, append array with first element and second element

At the end in 2Darray you will have your clusters.

Untested

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Bastien Hervé4.6k
0
gravatar for JC
2.1 years ago by
JC10k
Mexico
JC10k wrote:

You can use perl or python to get that:

#!/usr/bin/perl

use strict;
use warnings;

my %g = ();
my $a = undef;
my $b = undef;
my $net = 0;
my $fst = 1;

while (<>) {
  chomp;
  ($a, $b) = split (/\s+/, $_);
  if ($fst == 1) {
    #warn "first iteration, adding $a - $b in net $net\n";
    $g{$net}{$a} = 1;
    $g{$net}{$b} = 1;
    $fst = 0;
  }
  else {
    my $new = 1;
    for (my $n = 0; $n <= $net; $n++) {
      if (defined $g{$n}{$a}) {
        if (defined $g{$n}{$b}) {
          #warn "$a - $b exist in net $n\n";
          $new = 0;
          last;
        }
        else {
          #warn "$a exist in net $n, adding $b\n";
          $g{$n}{$b} = 1;
          $new = 0;
          last;
        }
      }
      else {
        if (defined $g{$n}{$b}) {
          #warn "$b exist in net $n, adding $a\n";
          $g{$n}{$a} = 1;
          $new = 0;
          last;
        }
      }
    }
    if ($new == 1) {
      $net++;
      #warn "$a - $b not seen in other nets, adding in a new net $net\n";
      $g{$net}{$a} = 1;
      $g{$net}{$b} = 1;
    }
  }
}

#warn "writting nets\n";
foreach $net (sort keys %g) {
  print join "\t", sort keys %{ $g{$net} };
  print "\n";
}

Example:

$ perl graph.pl < data.txt
AA      BB      CC      DD
EE      FF      GG      HH
II      JJ      KK
ADD COMMENTlink written 2.1 years ago by JC10k

Dear JC, I tested your script on a huge data file output of Mash and it is working! Thanks a lot! Very nice script :)

ADD REPLYlink written 2.1 years ago by stefano.campanaro0

Glad to help, how large was the data?

ADD REPLYlink written 2.1 years ago by JC10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 663 users visited in the last hour