Consolidating Two Columns
5
3
Entering edit mode
12.7 years ago
Kiriya ▴ 100

I have list two columns like this:

Col1    Col2
Name1    AB, AC, CF
Name1    AF, AV, CG, HG
Name2    BB, BF, CD, CK, JK
Name2    BC

I want to consolidate the two columns so they would like the following:

Col1    Col2
Name1    AB, AC, CF, AF, AV, CG, HG
Name2    BB, BF, CD, CK, JK, BC

Does anyone has script to do this?

programming python perl awk • 3.2k views
ADD COMMENT
1
Entering edit mode

I have KEGG results and I want to put all the sequences that are in one pathway together. I didn't want to make it too complicated. Here is an example: C5-Branched dibasic acid metabolism KK_Contig_49268 C5-Branched dibasic acid metabolism KK_Contig_12740, KK_Contig_52938, KK_Contig_51604, KK_Contig_9479, KK_Contig_49400, KK_Contig_28354 Glycolysis / Gluconeogenesis KK_Contig_50816, KK_Contig_8607, KK_Contig_15245, KK_Contig_22682 Glycolysis / Gluconeogenesis KK_Contig_27393

ADD REPLY
0
Entering edit mode

Hi, can you please explain what this has to do with bioinformatics? Pure programming questions are discouraged, please read the FAQ.

ADD REPLY
0
Entering edit mode

Are the 2 columns separated by a tab?

ADD REPLY
0
Entering edit mode

Yes, they are separated by a tab.

ADD REPLY
0
Entering edit mode

Do you care about the order of the entries being consolidated? Should duplicates be removed or not?

ADD REPLY
0
Entering edit mode

No, I don't care about the order and duplicates should be removed in the second column.

ADD REPLY
2
Entering edit mode
12.7 years ago

A script to do this would normally consist of two steps (irrespective of which programming language you use):

  1. Read the input file and collect all data in key-value data structure (this would be called a hash table in Perl, a dictionary in Python, or an unordered map in C++). The key would be what is in column 1 and the value would be collecting all results pertaining to the key in question.
  2. Iterate over all the (sorted) keys in the your key-value data structure and print out all the pairs to produce the output file.
ADD COMMENT
2
Entering edit mode
12.7 years ago

Check out the "groupBy" command in my filo package. It is designed to handle exactly this problem. GroupBy only requires that 1) your input data be sorted by the columns that you want to group/consolidate, and 2), your data are tab-delimited. Here is an example with your data:

$ cat data.txt 
#Col1   Col2
Name1   AB, AC, CF
Name1   AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK
Name2   BC

$ groupBy -i data.txt -grp 1 -opCols 2 -ops collapse
Name1   AB, AC, CF,AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK,BC

Or, more briefly:

$ groupBy -i data.txt -g 1 -c 2 -o collapse
Name1   AB, AC, CF,AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK,BC

Here are some other usage examples:

ADD COMMENT
0
Entering edit mode

Yes, this seems like a very useful tool although I didn't have to use it. One thing that is not clear is what do I need to do to install the package? I already downloaded it. Sorry if this sounds like a silly question - I am more of a biologist!

ADD REPLY
0
Entering edit mode

One should merely need to download the .tar.gz file (or clone it with git), unpack the tarball, cd into the newly created directory, and type "make".

ADD REPLY
2
Entering edit mode
12.7 years ago
Rm 8.3k

awk -F"t" '{if(NR==1){print} else hsh[$1]=hsh[$1]$2", "}END{for (i in hsh){print i" "hsh[i]}}' input_file | sed 's/, $//'

Output:

Col1    Col2
Name1 AB, AC, CF, AF, AV, CG, HG
Name2 BB, BF, CD, CK, JK, BC
ADD COMMENT
1
Entering edit mode
12.7 years ago

Assuming the 2 columns separated by a tab, sample input file will be:

echo -e "Col1\tCol2
Name1\tAB, AC, CF
Name1\tAF, AV, CG, HG
Name2\tBB, BF, CD, CK, JK
Name2\tBC" > sample1

[?]

One-liner-solution:

cat sample1 | ruby -e 'c = Hash.new([]); while l = STDIN.gets; next if STDIN.lineno == 1; key, values = l.chomp.split "\t"; values = values.split ", "; c[key] += values; end; puts "Col1\tCol2"; c.each do |key, values|; vjoined=values.join ", "; puts "#{key}\t#{vjoined}"; end'

[?]

Output:

Col1    Col2
Name1   AB, AC, CF, AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK, BC

[?]

The code in readable format (filename: consolidate.rb):

#!/usr/bin/env ruby

c = Hash.new []

while l = STDIN.gets
  next if STDIN.lineno == 1 # Cut the header off (first line)
  key, values = l.chomp.split "\t" 
  values = values.split ", "
  c[key] += values
end

puts "Col1\tCol2"
c.each do |key, values|
  puts "#{key}\t#{values.join ', '}"
end
ADD COMMENT
1
Entering edit mode
12.7 years ago
Anjan ▴ 830

! /usr/bin/perl -w

use strict;

open (F, "file") || die "cannot open file" ; #file has data in two tab-delimited columns.

my %hoa; # a hash of arrays.

while (<F>){
   chomp;
   my @line = split/\t/;
   push (@{$hoa{$line[0]}}, $line[1]);
}

foreach my $k (sort keys %hoa){
   my $data = join(",", @{$hoa{$k}});
   print("$k\t$data\n");
}
#amen.

have not checked script...but you get the idea.

ADD COMMENT
0
Entering edit mode

I just tweaked your "join" for correctness.

ADD REPLY
0
Entering edit mode

I guess it's a perl thing to end your script in "amen"?

ADD REPLY
0
Entering edit mode

thanks @brentp, goofed my editing of the code example.

ADD REPLY
0
Entering edit mode

The script is working well and thanks a lot!

ADD REPLY
0
Entering edit mode

@aaronQuinlan: thanks for the correction.

ADD REPLY

Login before adding your answer.

Traffic: 2555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6