Consolidating Two Columns
5
3
Entering edit mode
10.3 years ago
Kiriya ▴ 100

I have list two columns like this:

Col1    Col2
Name1    AB, AC, CF
Name1    AF, AV, CG, HG
Name2    BB, BF, CD, CK, JK
Name2    BC


I want to consolidate the two columns so they would like the following:

Col1    Col2
Name1    AB, AC, CF, AF, AV, CG, HG
Name2    BB, BF, CD, CK, JK, BC


Does anyone has script to do this?

programming python perl awk • 2.0k views
1
Entering edit mode

I have KEGG results and I want to put all the sequences that are in one pathway together. I didn't want to make it too complicated. Here is an example: C5-Branched dibasic acid metabolism KK_Contig_49268 C5-Branched dibasic acid metabolism KK_Contig_12740, KK_Contig_52938, KK_Contig_51604, KK_Contig_9479, KK_Contig_49400, KK_Contig_28354 Glycolysis / Gluconeogenesis KK_Contig_50816, KK_Contig_8607, KK_Contig_15245, KK_Contig_22682 Glycolysis / Gluconeogenesis KK_Contig_27393

0
Entering edit mode

Hi, can you please explain what this has to do with bioinformatics? Pure programming questions are discouraged, please read the FAQ.

0
Entering edit mode

Are the 2 columns separated by a tab?

0
Entering edit mode

Yes, they are separated by a tab.

0
Entering edit mode

Do you care about the order of the entries being consolidated? Should duplicates be removed or not?

0
Entering edit mode

No, I don't care about the order and duplicates should be removed in the second column.

2
Entering edit mode
10.3 years ago

A script to do this would normally consist of two steps (irrespective of which programming language you use):

1. Read the input file and collect all data in key-value data structure (this would be called a hash table in Perl, a dictionary in Python, or an unordered map in C++). The key would be what is in column 1 and the value would be collecting all results pertaining to the key in question.
2. Iterate over all the (sorted) keys in the your key-value data structure and print out all the pairs to produce the output file.
2
Entering edit mode
10.3 years ago

Check out the "groupBy" command in my filo package. It is designed to handle exactly this problem. GroupBy only requires that 1) your input data be sorted by the columns that you want to group/consolidate, and 2), your data are tab-delimited. Here is an example with your data:

$cat data.txt #Col1 Col2 Name1 AB, AC, CF Name1 AF, AV, CG, HG Name2 BB, BF, CD, CK, JK Name2 BC$ groupBy -i data.txt -grp 1 -opCols 2 -ops collapse
Name1   AB, AC, CF,AF, AV, CG, HG
Name2   BB, BF, CD, CK, JK,BC


Or, more briefly:

$groupBy -i data.txt -g 1 -c 2 -o collapse Name1 AB, AC, CF,AF, AV, CG, HG Name2 BB, BF, CD, CK, JK,BC  Here are some other usage examples: ADD COMMENT 0 Entering edit mode Yes, this seems like a very useful tool although I didn't have to use it. One thing that is not clear is what do I need to do to install the package? I already downloaded it. Sorry if this sounds like a silly question - I am more of a biologist! ADD REPLY 0 Entering edit mode One should merely need to download the .tar.gz file (or clone it with git), unpack the tarball, cd into the newly created directory, and type "make". ADD REPLY 2 Entering edit mode 10.3 years ago Rm 8.1k awk -F"t" '{if(NR==1){print} else hsh[$1]=hsh[$1]$2", "}END{for (i in hsh){print i" "hsh[i]}}' input_file | sed 's/, $//' Output: Col1 Col2 Name1 AB, AC, CF, AF, AV, CG, HG Name2 BB, BF, CD, CK, JK, BC  ADD COMMENT 1 Entering edit mode 10.3 years ago Assuming the 2 columns separated by a tab, sample input file will be: echo -e "Col1\tCol2 Name1\tAB, AC, CF Name1\tAF, AV, CG, HG Name2\tBB, BF, CD, CK, JK Name2\tBC" > sample1  [?] One-liner-solution: cat sample1 | ruby -e 'c = Hash.new([]); while l = STDIN.gets; next if STDIN.lineno == 1; key, values = l.chomp.split "\t"; values = values.split ", "; c[key] += values; end; puts "Col1\tCol2"; c.each do |key, values|; vjoined=values.join ", "; puts "#{key}\t#{vjoined}"; end'  [?] Output: Col1 Col2 Name1 AB, AC, CF, AF, AV, CG, HG Name2 BB, BF, CD, CK, JK, BC  [?] The code in readable format (filename: consolidate.rb): #!/usr/bin/env ruby c = Hash.new [] while l = STDIN.gets next if STDIN.lineno == 1 # Cut the header off (first line) key, values = l.chomp.split "\t" values = values.split ", " c[key] += values end puts "Col1\tCol2" c.each do |key, values| puts "#{key}\t#{values.join ', '}" end  ADD COMMENT 1 Entering edit mode 10.3 years ago Anjan ▴ 830 # ! /usr/bin/perl -w use strict; open (F, "file") || die "cannot open file" ; #file has data in two tab-delimited columns. my %hoa; # a hash of arrays. while (<F>){ chomp; my @line = split/\t/; push (@{$hoa{$line[0]}},$line[1]);
}

foreach my $k (sort keys %hoa){ my$data = join(",", @{$hoa{$k}});
print("$k\t$data\n");
}
#amen.


have not checked script...but you get the idea.

0
Entering edit mode

I just tweaked your "join" for correctness.

0
Entering edit mode

I guess it's a perl thing to end your script in "amen"?

0
Entering edit mode

thanks @brentp, goofed my editing of the code example.

0
Entering edit mode

The script is working well and thanks a lot!

0
Entering edit mode

@aaronQuinlan: thanks for the correction.