Question: Comparing multiple columns from 2 files using awk, perl or python
0
gravatar for aboozar.soorni
5 months ago by
aboozar.soorni30 wrote:

I have 2 files as below:

File 1:

SpoScf_15890    12  2376
SpoScf_00032    299634  316568
SpoScf_15890    656772  669220
SpoScf_00032    667632  674746
SpoScf_07684    10  4075
SpoScf_07684    64  4276
SpoScf_00032    820227  826573
Super_scaffold_60   74732   78743
Super_scaffold_60   1101694 1102317
Super_scaffold_60   74955   77543

File 2:

SpoScf_15890    1   2976
SpoScf_23593    1   2413
SpoScf_51672    1   1782
SpoScf_07684    91  4078
SpoScf_03142    8164    12518
SpoScf_04517    8723    11547
SpoScf_02476    10671   14488
SpoScf_01270    63995   66773
SpoScf_00853    73199   75746
Super_scaffold_60   74936   77943

I would like to compare these files using awk, Perl, or python, specifically if column 1 of file1 is equal to column 1 of file2 (condition 1) and column 2 of file 1 is less than column 2 of file2 (condition 2) or column3 of file1 is less than column 3 of file2 (condition 2) and/or both column 2 and 3 of file 1 are less than column 2 and 3 of file2. If conditions satisfy then extract the entire line of file1. The file sizes are not the same. Also, there is no one to one correspondence between the lines in the two files.

Result

SpoScf_15890    12  2376
SpoScf_07684    10  4075
SpoScf_07684    64  4278
Super_scaffold_60   74732   78743
Super_scaffold_60   74955   77543

Any help appreciated.

Thanks

awk python perl • 322 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by aboozar.soorni30

So a match between the files is:

            File1      File2
            -----      -----
            Column1 == Column1
            Column2 <  Column2
            Column3 <  Column3
(Column2 & Column3) <  (Column2 & Column3)

What if one column entry is larger and the other smaller? Is this possible?

Only less than, or less than and equal to?

ADD REPLYlink modified 5 months ago • written 5 months ago by Joe18k

Thanks to you! Both scripts work well.

I would like to add one more condition if it is possible for you to edit the script. Another condition: Column 2 of file 1 is less than column 2 of file2 "but not more than 1000 (1K)" and also Column 3 of file 1 is less than column 3 of file2 "but not more than 1000 (1K)?

ADD REPLYlink written 5 months ago by aboozar.soorni30
import sys

d = {}
for argv in [2, 1]:
    with open(sys.argv[argv]) as fh:
        for line in fh:
            sl = line.split()
            id = sl[0]
            st = int(sl[1])
            en = int(sl[2])
            if argv == 2:
                d[id] = [st, en]
            else:
                if id in d:
                    if (d[id][0] - st <= 100000) or (d[id][1] - en <= 100000):
                        print(line.strip())
ADD REPLYlink written 5 months ago by a.zielezinski9.3k

Please refrain from adding new questions as answers - they should be comments. I have moved them this time.

It is also generally bad practice/forum etiquette to continually ask for modifications and updates to the original question as it muddles the information for future users who might need to learn from this thread.

ADD REPLYlink written 5 months ago by Joe18k
2
gravatar for JC
5 months ago by
JC12k
Mexico
JC12k wrote:

You need a script for that, here my solution in Perl:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[1] or die "use substract.pl FILE1 FILE2 > OUTPUT\n";

my $file1 = shift @ARGV;
my $file2 = shift @ARGV;
my ($id, $ini, $end);

# load data from file2
my %data2 = ();
open (my $f2, "<", "$file2") or die "cannot read $file2\n";
while (<$f2>) {
    chomp;
    ($id, $ini, $end) = split (/\s+/, $_);
    $data2{$id}{'ini'} = $ini;
    $data2{$id}{'end'} = $end;
}
close $f2;

# read and process file1
open (my $f1, "<", "$file1") or die "cannot read $file1\n";
while (<$f1>) {
    chomp;
    ($id, $ini, $end) = split (/\s+/, $_);
    next unless (defined $data2{$id});
    print "$_\n" if ( $ini <= $data2{$id}{'ini'} or $end <= $data2{$id}{'end'});
}
close $f1;

Using it:

$ perl substract.pl file1.txt file2.txt
SpoScf_15890    12  2376
SpoScf_07684    10  4075
SpoScf_07684    64  4276
Super_scaffold_60   74732   78743
Super_scaffold_60   74955   77543
ADD COMMENTlink written 5 months ago by JC12k
1
gravatar for a.zielezinski
5 months ago by
a.zielezinski9.3k
a.zielezinski9.3k wrote:

Here's Python script:

import sys

d = {}
for argv in [2, 1]:
    with open(sys.argv[argv]) as fh:
        for line in fh:
            sl = line.split()
            id = sl[0]
            st = int(sl[1])
            en = int(sl[2])
            if argv == 2:
                d[id] = [st, en]
            else:
                if id in d and (st <= d[id][0] or en <= d[id][1]):
                    print(line.strip())

Run it:

python substract.py file1.txt file2.txt

Output:

SpoScf_15890    12  2376
SpoScf_07684    10  4075
SpoScf_07684    64  4276
Super_scaffold_60   74732   78743
Super_scaffold_60   74955   77543
ADD COMMENTlink written 5 months ago by a.zielezinski9.3k
0
gravatar for Jorge Amigo
5 months ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

It doesn't exactly answer your question, but if you're dealing with regions you may want to have a look to bedtools:

$ bedtools intersect -a file1.bed -b file2.bed -wa
SpoScf_15890    12      2376
SpoScf_07684    10      4075
SpoScf_07684    64      4276
Super_scaffold_60       74732   78743
Super_scaffold_60       74955   77543

As I said, this doesn't answer your question since this tool looks for overlaps rather than positions-less-than conditions, but it does allow you to think about a faster and more eficient way of handling region information. Having said this, and although it's not the most efficient way to answer your second question with the "but not more than 1000 (1K)" condition, you can still solve it with bedtools:

$ perl -lane 'foreach $i (1,2) { print join "\t", $F[0], $F[$i]-1, $F[$i], (join " ", @F) }' file1.bed > file1.bed.temp
$ perl -lane 'foreach $i (1,2) { if ($F[$i] > 1000) { $t = $F[$i] - 1001 } else { $t = 0 } print join "\t", $F[0], $t, $F[$i] }' file2.bed > file2.bed.temp
$ bedtools intersect -a file1.bed.temp -b file2.bed.temp -wa | cut -f4 | sort | uniq -c | perl -lane 'print join "\t", @F[1..3] if $F[0] == 2'
SpoScf_07684    10      4075
ADD COMMENTlink modified 5 months ago • written 5 months ago by Jorge Amigo12k
0
gravatar for aboozar.soorni
5 months ago by
aboozar.soorni30 wrote:

Thanks to all,

Unfortunately, both a.zielezinski and Amigo answers could not answer my question when I added a new condition ("but not more than 1000"). As Amigo said, bedtools doesn't answer my question because this tool looks for overlaps, not positions-less-than or positions-greater-than conditions. Let me bring a new input example files and the output file which I expect to get.

file 1:

SpoScf_15890    12  1376
SpoScf_15890    12  2376
Super_scaffold_60   73836   77943
Super_scaffold_60   73926   76843
Super_scaffold_60   78936   87943

file 2:

SpoScf_15890    1   3376
Super_scaffold_60   74936   77943

output:

SpoScf_15890    12  2376
Super_scaffold_60   73836   78943
Super_scaffold_60   73926   76843

in file 1, first-line (SpoScf_15890 12 1376) will not be extracted because column 3 (1376 ) is less than 3376 but more than 1000 (1376 - 3376 = -2000 (should not be more than -1000 if want to be extracted).

ADD COMMENTlink modified 5 months ago • written 5 months ago by aboozar.soorni30

Why is the proposed solution for your additional condition not working? What's happening that you don't want to happen?

For the sake of the clarity of this post, it'd be more useful to reply to a.zielinski's comment above if you run into errors running the modified script.

ADD REPLYlink written 5 months ago by Friederike6.5k

Are sure the output is correct? I think that only "SpoScf_15890 12 2376" fulfils the criteria (start or end in the file1 should be less than in file2, and the difference shouldn't be greater than 1000)

ADD REPLYlink written 5 months ago by a.zielezinski9.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1269 users visited in the last hour