How To Turn A Two Column Data File Containing Pairs Into A series of clusters
3
0
Entering edit mode
5.9 years ago

Hi All, I intersected to data sets and I have a tabulated file with two columns like:

AA     BB

BB      CC

CC      DD

BB      AA

EE      FF

FF      GG

GG      HH

II      JJ

JJ      II

II      KK

...     ...

and I would like to convert "one-to.one interactions" in clusters considering that AA interacts with BB, BB with CC and CC with DD (so AA, BB, CC and DD form a cluster). Similarly EE, FF, GG, HH form another cluster but none of these elements interact with elements of the first cluster and so on. I would like to obtain something like

AA BB CC DD 
EE FF GG HH
II JJ KK

...

Would you please help me how I can do that?

software error • 1.3k views
ADD COMMENT
0
Entering edit mode

Question : If somewhere in the file you have BB GG, you want to get a single cluster (AA, BB, CC, DD, EE, FF, GG and HH) ?

I would like to obtain something like AA BB CC DD EE FF GG HH II JJ KK

I don't understand this line

ADD REPLY
0
Entering edit mode

Why is this tagged as a software error question ?

ADD REPLY
1
Entering edit mode
5.9 years ago

The data represents a graph in edge list format, i.e. each line is an edge of the graph specifying the two nodes that are connected. What you call clusters seems to be the connected components of this graph. So read the data into a graph structure then extract the connected components, e.g. in R with the igraph package, something like this (untested):

edge.list <- as.matrix(read.table("edge_list.txt",...)) # read the file as appropriate, turn data into a two-column matrix for use by igraph
G <- graph_from_edge_list(edge.list, directed = FALSE)
clusters <- components(G)
ADD COMMENT
0
Entering edit mode
5.9 years ago

In pseudo code, if you want to try to write it

Initialize an 2D array : 2Darray

Import your file in a dataframe

Sort the dataframe by column one and column two, to get something like :

AA BB

BB AA

BB CC

CC DD

For each line of your dataframe

  • If it's the first line of the datafame, create and array and append first element and second element of the line
  • Else, does the first element exist in array ?
    • If yes, append array with the second element
    • Else, append 2Darray with array, reinitilize array, append array with first element and second element

At the end in 2Darray you will have your clusters.

Untested

ADD COMMENT
0
Entering edit mode
5.9 years ago
JC 13k

You can use perl or python to get that:

#!/usr/bin/perl

use strict;
use warnings;

my %g = ();
my $a = undef;
my $b = undef;
my $net = 0;
my $fst = 1;

while (<>) {
  chomp;
  ($a, $b) = split (/\s+/, $_);
  if ($fst == 1) {
    #warn "first iteration, adding $a - $b in net $net\n";
    $g{$net}{$a} = 1;
    $g{$net}{$b} = 1;
    $fst = 0;
  }
  else {
    my $new = 1;
    for (my $n = 0; $n <= $net; $n++) {
      if (defined $g{$n}{$a}) {
        if (defined $g{$n}{$b}) {
          #warn "$a - $b exist in net $n\n";
          $new = 0;
          last;
        }
        else {
          #warn "$a exist in net $n, adding $b\n";
          $g{$n}{$b} = 1;
          $new = 0;
          last;
        }
      }
      else {
        if (defined $g{$n}{$b}) {
          #warn "$b exist in net $n, adding $a\n";
          $g{$n}{$a} = 1;
          $new = 0;
          last;
        }
      }
    }
    if ($new == 1) {
      $net++;
      #warn "$a - $b not seen in other nets, adding in a new net $net\n";
      $g{$net}{$a} = 1;
      $g{$net}{$b} = 1;
    }
  }
}

#warn "writting nets\n";
foreach $net (sort keys %g) {
  print join "\t", sort keys %{ $g{$net} };
  print "\n";
}

Example:

$ perl graph.pl < data.txt
AA      BB      CC      DD
EE      FF      GG      HH
II      JJ      KK
ADD COMMENT
0
Entering edit mode

Dear JC, I tested your script on a huge data file output of Mash and it is working! Thanks a lot! Very nice script :)

ADD REPLY
0
Entering edit mode

Glad to help, how large was the data?

ADD REPLY

Login before adding your answer.

Traffic: 2955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6