Question: merge two files
gravatar for biolab
4.8 years ago by
biolab1.1k wrote:

Dear all,

I want to merge two data files. The first contains three columns($a, $c, $d), while the second contains two columns ($b, $c).  I need to generate a file having four columns ($a, $b, $c, $d) .  I have problem with my perl script shown below.  Could anyone help?  Or there should be easier way.  Thank you very much!

mir-a        gene1        33
mir-a        gene2        34
mir-a        gene3        89
mir-b        gene1        09
mir-b        gene3        33
mir-c        gene1        86
mir-c        gene2        20

group1        gene1
group1        gene3
group2        gene1
group3        gene1
group3        gene2

merged result should be:
mir-a        group1        gene1        33
mir-a        group2        gene1        33
mir-a        group3        gene1        33
mir-a        group3        gene2        34
mir-a        group1        gene3        89
mir-b        group1        gene1        09
mir-b        group2        gene1        09
mir-b        group3        gene1        09
mir-b        group1        gene3        33
mir-c        group1        gene1        86
mir-c        group2        gene1        86
mir-c        group3        gene1        86
mir-c        group3        gene2        20

my problematic perl script

#!/usr/bin/perl -w
use strict;

my $file1 = shift;
my $file2 = shift;

my (%hash1, %hash2, %hash_merge);

#read dat1.txt
open F1, $file1;
while(my $line1 = <F1>){
    chomp $line1;
    my ($a, $c, $d) = split/\s+/,$line1;
    $hash1{$a}{$c}{$d} = 1;
close F1;

#read dat2.txt
open F2, $file2;
while(my $line2 = <F2>){
    chomp $line2;
    my ($b, $c) = split/\s+/,$line2;
    $hash2{$b}{$c} = 1;
close F2;

#merge hash1 and hash2
foreach my $a (keys %hash1){
    foreach my $c (keys %{$hash1{$a}}{
        foreach my $b (keys %hash2){ #find column $b in dat2.txt
            next unless exists $hash2{$b}{$c};
            foreach my $d (keys %{$hash1{$a}{$c}}){
                $hash_merge{$a}{$b}{$c}{$d} = 1;

#print out hash_merge
foreach my $a (keys %hash_merge){
    foreach my $b (keys %{$hash_merge{$a}}{
        foreach my $c (keys %{$hash_merge{$a}{$b}}{
            foreach my $d (keys %{$hash_merge{$a}{$b}{$c}}{
                print "$a\t$b\t$c\t$d\n";


perl • 1.5k views
ADD COMMENTlink modified 4.8 years ago by dariober10k • written 4.8 years ago by biolab1.1k
gravatar for dariober
4.8 years ago by
WCIP | Glasgow | UK
dariober10k wrote:

If you are not settled with perl and you favour a simple solution, I would go for R:

d1<- read.table('data1.txt')
d2<- read.table('data2.txt')

merged<- merge(d1, d2, by= c('V2'))[, c(2, 1, 3, 4)]
write.table(merged, 'merged.txt', col.names= FALSE, row.names= FALSE, sep= '\t', quote= FALSE)

Merged file is going to be merged.txt:

mir-a    gene1    33    group1
mir-a    gene1    33    group2
mir-a    gene1    33    group3
mir-b    gene1    9    group1
mir-b    gene1    9    group2
mir-b    gene1    9    group3
mir-c    gene1    86    group1
mir-c    gene1    86    group2
mir-c    gene1    86    group3
mir-a    gene2    34    group3
mir-c    gene2    20    group3
mir-a    gene3    89    group1
mir-b    gene3    33    group1

This assuming RAM is not a limiting factor as R will load everything in memory. If speed as an issue, I would use the data.table package.

Even less verbose is unix join command:

join -1 2 -2 2 <(sort -k2,2 data1.txt) <(sort -k2,2 data2.txt)

But it's not as flexible as the R solution.

ADD COMMENTlink written 4.8 years ago by dariober10k

Alex and dariober, thanks a lot for your inputs.  Really helpful!

ADD REPLYlink written 4.8 years ago by biolab1.1k
gravatar for Alex Reynolds
4.8 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Untested, but I think the following should work, by simplifying the keys used in your hash tables:


use strict;
use warnings;

my $fn1 = $ARGV[0];
my $fn2 = $ARGV[1];

# we assume that this character can always delimit mir-X and geneY names
my $delimiter = "%"; 

my $mirgenes;
open my $fh1, "<", $fn1 or die "could not open $fn1\n";
while (<$fh1>) {
    my ($mir, $gene, $count) = split("\t", $_);
    my $mirgene = $mir.$delimiter.$gene;
    $mirgenes->{$mirgene} = $count;
close $fh1;

my $groups;
open my $fh2, "<", $fn2 or die "could not open $fn2\n";
while (<$fh2>) {
    my ($group, $gene) = split("\t", $_);
    if (! defined $groups->{$gene}) {
        @{$groups->{$gene}} = ();
    push(@{$groups->{$gene}}, $group);
close $fh2;

foreach my $mirgene (keys %{$mirgenes}) {
    my ($mir, $gene) = split($delimiter, $mirgene);
    my $count = $mirgenes->{$mirgene};
    foreach my $group (@{$groups->{$gene}}) {
        print STDOUT "$mir\t$group\t$gene\t$count\n";

To use, perhaps:

$ ./ data1.txt data2.txt | sort -k1,3 > answer.txt
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Alex Reynolds28k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 732 users visited in the last hour