Question

How to make combinations of the values in column 2

0

Entering edit mode

3.5 years ago

debarunacharya • 0

Hello, I need to make random nonredundant combinations of strings (in column 2) under a particular criteria (in column 1). The Input and outputs will be as follows. Note: For simplification, I have considered the combination of three string values.

INPUT:

Uniprot_ID_A Uniprot_ID_B

P00001 Q00001

P00001 Q00002

P00001 Q00003

P00002 R00001

P00002 R00002

P00002 R00003

OUTPUT:

Uniprot_ID_A Combinations_of_Uniprot_ID_B

P00001 <tab> Q00001 <tab> Q00002

P00001 <tab> Q00001 <tab> Q00003

P00001 <tab> Q00002 <tab> Q00003

P00002 <tab> R00001 <tab> R00002

P00002 <tab> R00001 <tab> R00003

P00002 <tab> R00002 <tab> R00003

The combinations should be tab separated and the first column will be printed in the output. As I am not a coding expert, hence simple solutions will be highly appreciated. Thanks in Advance.

combinations random combinations perl awk python • 732 views

ADD COMMENT • link updated 3.4 years ago by i-blis • 0 • written 3.5 years ago by debarunacharya • 0

1

Entering edit mode

What have you already tried? Please share with us

ADD REPLY • link 3.5 years ago by Kevin Blighe 87k

score 0 · Answer 1 · 2020-11-24

You probably figured it out (or had it figured out) by now. In case you'd still need some enlightenment to get started the next time you face a similar problem, here you go.

If I got the problem statement right, you want all unordered pairs of values of column B for each value of column A, keeping everything in the same order as they appear.

This involves 2 steps:

Gather all the B values for a given A entries
Given a list of B values, build a list of pairs shifting indices forward twice:
(b_1,b_2), (b_1,b_3), ... , (b_1,b_n) ; (b_2,b_3) , ... , (b_3,b_n) ; ... ; (b_n-1, b_n)

Step 1 is easily achieved with an associative array (as AWK aptly names them), also known as a hash (in Perl), dictionary (in Python) or map (in most functional languages).

Step 2 amounts to three embedded loops. Looping over unique A entries, looping twice over B values to get a all pairs (shifting the indices to avoid getting (b_m,b_n) for m >= n).

If the amount of records were huge and granted that keys (column A values) are grouped in sequence, step 1 and 2 could be achieved in one loop over the data. But I won't even make the assumption that the A entries are grouped.

The code below should be pretty easy to follow. Should you need comments on the syntax, just ask in the comments.

Perl take

#!/usr/bin/env perl
use strict; use warnings;

my (%recs, @heads); 
while (<>) {
    my ($head, $tail) = split /\s+/;
    push @heads, $head unless $recs{$head};
    push @{$recs{$head}}, $tail;
}

for my $head (@heads) {
    my $vals = $recs{$head};
    for my $i (0 .. $#$vals-1) {
        for my $j ($i+1 .. $#$vals) {
            print $head,"\t",$vals->[$i],"\t",$vals->[$j],"\n"
        }
    }
}

You may run and modify it online with your sample data, if you want.

Note that because Perl hashes are unordered, we need to keep the A entries in an array (@heads) in the order they appear.

AWK take

AWK is more elegant in my opinion. Note that it requires GNU awk for multidimensional array support.

#!/usr/bin/env gawk

BEGIN { OFS="\t" }

NF { count[$1]=++count[$1]
     rec[$1][count[$1]]=$2 }

END {
    for (head in rec) 
        for (i=1; i<length(rec[head]); i++) 
            for (j=i+1; j<=length(rec[head]); j++) 
                print head, rec[head][i], rec[head][j] }

Again, try it online.