Question: Consolidating Two Columns
3
gravatar for Kiriya
5.7 years ago by
Kiriya100
Kiriya100 wrote:

I have list two columns like this:

Col1    Col2
Name1    AB, AC, CF
Name1    AF, AV, CG, HG
Name2    BB, BF, CD, CK, JK
Name2    BC

I want to consolidate the two columns so they would like the following:

Col1    Col2
Name1    AB, AC, CF, AF, AV, CG, HG
Name2    BB, BF, CD, CK, JK, BC

Does anyone has script to do this?

perl python programming awk • 1.1k views
ADD COMMENTlink modified 5.7 years ago by Rm7.5k • written 5.7 years ago by Kiriya100
1

I have KEGG results and I want to put all the sequences that are in one pathway together. I didn't want to make it too complicated. Here is an example: C5-Branched dibasic acid metabolism KK_Contig_49268 C5-Branched dibasic acid metabolism KK_Contig_12740, KK_Contig_52938, KK_Contig_51604, KK_Contig_9479, KK_Contig_49400, KK_Contig_28354 Glycolysis / Gluconeogenesis KK_Contig_50816, KK_Contig_8607, KK_Contig_15245, KK_Contig_22682 Glycolysis / Gluconeogenesis KK_Contig_27393

ADD REPLYlink written 5.7 years ago by Kiriya100

Hi, can you please explain what this has to do with bioinformatics? Pure programming questions are discouraged, please read the FAQ.

ADD REPLYlink written 5.7 years ago by Michael Dondrup41k

Are the 2 columns separated by a tab?

ADD REPLYlink written 5.7 years ago by Aleksandr Levchuk3.0k

Yes, they are separated by a tab.

ADD REPLYlink written 5.7 years ago by Kiriya100

Do you care about the order of the entries being consolidated? Should duplicates be removed or not?

ADD REPLYlink written 5.7 years ago by Peter5.4k

No, I don't care about the order and duplicates should be removed in the second column.

ADD REPLYlink written 5.7 years ago by Kiriya100
2
gravatar for Lars Juhl Jensen
5.7 years ago by
Copenhagen, Denmark
Lars Juhl Jensen10k wrote:

A script to do this would normally consist of two steps (irrespective of which programming language you use):

  1. Read the input file and collect all data in key-value data structure (this would be called a hash table in Perl, a dictionary in Python, or an unordered map in C++). The key would be what is in column 1 and the value would be collecting all results pertaining to the key in question.
  2. Iterate over all the (sorted) keys in the your key-value data structure and print out all the pairs to produce the output file.
ADD COMMENTlink written 5.7 years ago by Lars Juhl Jensen10k
2
gravatar for Aaronquinlan
5.7 years ago by
Aaronquinlan9.7k
United States
Aaronquinlan9.7k wrote:

Check out the "groupBy" command in my filo package. It is designed to handle exactly this problem. GroupBy only requires that 1) your input data be sorted by the columns that you want to group/consolidate, and 2), your data are tab-delimited. Here is an example with your data:

$ cat data.txt 
#Col1   Col2
Name1   AB, AC, CF
Name1   AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK
Name2   BC

$ groupBy -i data.txt -grp 1 -opCols 2 -ops collapse
Name1   AB, AC, CF,AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK,BC

Or, more briefly:

$ groupBy -i data.txt -g 1 -c 2 -o collapse
Name1   AB, AC, CF,AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK,BC

Here are some other usage examples:

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Aaronquinlan9.7k

Yes, this seems like a very useful tool although I didn't have to use it. One thing that is not clear is what do I need to do to install the package? I already downloaded it. Sorry if this sounds like a silly question - I am more of a biologist!

ADD REPLYlink written 5.7 years ago by Kiriya100

One should merely need to download the .tar.gz file (or clone it with git), unpack the tarball, cd into the newly created directory, and type "make".

ADD REPLYlink written 5.7 years ago by Aaronquinlan9.7k
2
gravatar for Rm
5.7 years ago by
Rm7.5k
Danville, PA
Rm7.5k wrote:

awk -F"t" '{if(NR==1){print} else hsh[$1]=hsh[$1]$2", "}END{for (i in hsh){print i" "hsh[i]}}' input_file | sed 's/, $//'

Output:

Col1    Col2
Name1 AB, AC, CF, AF, AV, CG, HG
Name2 BB, BF, CD, CK, JK, BC
ADD COMMENTlink written 5.7 years ago by Rm7.5k
1
gravatar for Aleksandr Levchuk
5.7 years ago by
United States
Aleksandr Levchuk3.0k wrote:

Assuming the 2 columns separated by a tab, sample input file will be:

echo -e "Col1\tCol2
Name1\tAB, AC, CF
Name1\tAF, AV, CG, HG
Name2\tBB, BF, CD, CK, JK
Name2\tBC" > sample1

[?]

One-liner-solution:

cat sample1 | ruby -e 'c = Hash.new([]); while l = STDIN.gets; next if STDIN.lineno == 1; key, values = l.chomp.split "\t"; values = values.split ", "; c[key] += values; end; puts "Col1\tCol2"; c.each do |key, values|; vjoined=values.join ", "; puts "#{key}\t#{vjoined}"; end'

[?]

Output:

Col1    Col2
Name1   AB, AC, CF, AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK, BC

[?]

The code in readable format (filename: consolidate.rb):

#!/usr/bin/env ruby

c = Hash.new []

while l = STDIN.gets
  next if STDIN.lineno == 1 # Cut the header off (first line)
  key, values = l.chomp.split "\t" 
  values = values.split ", "
  c[key] += values
end

puts "Col1\tCol2"
c.each do |key, values|
  puts "#{key}\t#{values.join ', '}"
end
ADD COMMENTlink written 5.7 years ago by Aleksandr Levchuk3.0k
1
gravatar for Anjan
5.7 years ago by
Anjan670
United States
Anjan670 wrote:

! /usr/bin/perl -w

use strict;

open (F, "file") || die "cannot open file" ; #file has data in two tab-delimited columns.

my %hoa; # a hash of arrays.

while (<F>){
   chomp;
   my @line = split/\t/;
   push (@{$hoa{$line[0]}}, $line[1]);
}

foreach my $k (sort keys %hoa){
   my $data = join(",", @{$hoa{$k}});
   print("$k\t$data\n");
}
#amen.

have not checked script...but you get the idea.

ADD COMMENTlink modified 5.7 years ago by Aaronquinlan9.7k • written 5.7 years ago by Anjan670

I just tweaked your "join" for correctness.

ADD REPLYlink written 5.7 years ago by Aaronquinlan9.7k

I guess it's a perl thing to end your script in "amen"?

ADD REPLYlink written 5.7 years ago by brentp22k

thanks @brentp, goofed my editing of the code example.

ADD REPLYlink written 5.7 years ago by Aaronquinlan9.7k

The script is working well and thanks a lot!

ADD REPLYlink written 5.7 years ago by Kiriya100

@aaronQuinlan: thanks for the correction.

ADD REPLYlink written 5.7 years ago by Anjan670
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1660 users visited in the last hour