Comparing multiple columns from 2 files using awk, perl or python
4
0
Entering edit mode
2.2 years ago

I have 2 files as below:

File 1:

SpoScf_15890    12  2376
SpoScf_00032    299634  316568
SpoScf_15890    656772  669220
SpoScf_00032    667632  674746
SpoScf_07684    10  4075
SpoScf_07684    64  4276
SpoScf_00032    820227  826573
Super_scaffold_60   74732   78743
Super_scaffold_60   1101694 1102317
Super_scaffold_60   74955   77543


File 2:

SpoScf_15890    1   2976
SpoScf_23593    1   2413
SpoScf_51672    1   1782
SpoScf_07684    91  4078
SpoScf_03142    8164    12518
SpoScf_04517    8723    11547
SpoScf_02476    10671   14488
SpoScf_01270    63995   66773
SpoScf_00853    73199   75746
Super_scaffold_60   74936   77943


I would like to compare these files using awk, Perl, or python, specifically if column 1 of file1 is equal to column 1 of file2 (condition 1) and column 2 of file 1 is less than column 2 of file2 (condition 2) or column3 of file1 is less than column 3 of file2 (condition 2) and/or both column 2 and 3 of file 1 are less than column 2 and 3 of file2. If conditions satisfy then extract the entire line of file1. The file sizes are not the same. Also, there is no one to one correspondence between the lines in the two files.

Result

SpoScf_15890    12  2376
SpoScf_07684    10  4075
SpoScf_07684    64  4278
Super_scaffold_60   74732   78743
Super_scaffold_60   74955   77543


Any help appreciated.

Thanks

awk perl python • 1.6k views
0
Entering edit mode

So a match between the files is:

            File1      File2
-----      -----
Column1 == Column1
Column2 <  Column2
Column3 <  Column3
(Column2 & Column3) <  (Column2 & Column3)


What if one column entry is larger and the other smaller? Is this possible?

Only less than, or less than and equal to?

0
Entering edit mode

Thanks to you! Both scripts work well.

I would like to add one more condition if it is possible for you to edit the script. Another condition: Column 2 of file 1 is less than column 2 of file2 "but not more than 1000 (1K)" and also Column 3 of file 1 is less than column 3 of file2 "but not more than 1000 (1K)?

0
Entering edit mode
import sys

d = {}
for argv in [2, 1]:
with open(sys.argv[argv]) as fh:
for line in fh:
sl = line.split()
id = sl[0]
st = int(sl[1])
en = int(sl[2])
if argv == 2:
d[id] = [st, en]
else:
if id in d:
if (d[id][0] - st <= 100000) or (d[id][1] - en <= 100000):
print(line.strip())

0
Entering edit mode

It is also generally bad practice/forum etiquette to continually ask for modifications and updates to the original question as it muddles the information for future users who might need to learn from this thread.

2
Entering edit mode
2.2 years ago
JC 13k

You need a script for that, here my solution in Perl:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[1] or die "use substract.pl FILE1 FILE2 > OUTPUT\n"; my$file1 = shift @ARGV;
my $file2 = shift @ARGV; my ($id, $ini,$end);

my %data2 = ();
open (my $f2, "<", "$file2") or die "cannot read $file2\n"; while (<$f2>) {
chomp;
($id,$ini, $end) = split (/\s+/,$_);
$data2{$id}{'ini'} = $ini;$data2{$id}{'end'} =$end;
}
close $f2; # read and process file1 open (my$f1, "<", "$file1") or die "cannot read$file1\n";
while (<$f1>) { chomp; ($id, $ini,$end) = split (/\s+/, $_); next unless (defined$data2{$id}); print "$_\n" if ( $ini <=$data2{$id}{'ini'} or$end <= $data2{$id}{'end'});
}
close $f1;  Using it: $ perl substract.pl file1.txt file2.txt
SpoScf_15890    12  2376
SpoScf_07684    10  4075
SpoScf_07684    64  4276
Super_scaffold_60   74732   78743
Super_scaffold_60   74955   77543

2
Entering edit mode
2.2 years ago

Here's Python script:

import sys

d = {}
for argv in [2, 1]:
with open(sys.argv[argv]) as fh:
for line in fh:
sl = line.split()
id = sl[0]
st = int(sl[1])
en = int(sl[2])
if argv == 2:
d[id] = [st, en]
else:
if id in d and (st <= d[id][0] or en <= d[id][1]):
print(line.strip())


Run it:

python substract.py file1.txt file2.txt


Output:

SpoScf_15890    12  2376
SpoScf_07684    10  4075
SpoScf_07684    64  4276
Super_scaffold_60   74732   78743
Super_scaffold_60   74955   77543

0
Entering edit mode
2.2 years ago

It doesn't exactly answer your question, but if you're dealing with regions you may want to have a look to bedtools:

$bedtools intersect -a file1.bed -b file2.bed -wa SpoScf_15890 12 2376 SpoScf_07684 10 4075 SpoScf_07684 64 4276 Super_scaffold_60 74732 78743 Super_scaffold_60 74955 77543  As I said, this doesn't answer your question since this tool looks for overlaps rather than positions-less-than conditions, but it does allow you to think about a faster and more eficient way of handling region information. Having said this, and although it's not the most efficient way to answer your second question with the "but not more than 1000 (1K)" condition, you can still solve it with bedtools: $ perl -lane 'foreach $i (1,2) { print join "\t",$F[0], $F[$i]-1, $F[$i], (join " ", @F) }' file1.bed > file1.bed.temp
$perl -lane 'foreach$i (1,2) { if ($F[$i] > 1000) { $t =$F[$i] - 1001 } else {$t = 0 } print join "\t", $F[0],$t, $F[$i] }' file2.bed > file2.bed.temp
$bedtools intersect -a file1.bed.temp -b file2.bed.temp -wa | cut -f4 | sort | uniq -c | perl -lane 'print join "\t", @F[1..3] if$F[0] == 2'
SpoScf_07684    10      4075

0
Entering edit mode
2.2 years ago

Thanks to all,

Unfortunately, both a.zielezinski and Amigo answers could not answer my question when I added a new condition ("but not more than 1000"). As Amigo said, bedtools doesn't answer my question because this tool looks for overlaps, not positions-less-than or positions-greater-than conditions. Let me bring a new input example files and the output file which I expect to get.

file 1:

SpoScf_15890    12  1376
SpoScf_15890    12  2376
Super_scaffold_60   73836   77943
Super_scaffold_60   73926   76843
Super_scaffold_60   78936   87943


file 2:

SpoScf_15890    1   3376
Super_scaffold_60   74936   77943


output:

SpoScf_15890    12  2376
Super_scaffold_60   73836   78943
Super_scaffold_60   73926   76843


in file 1, first-line (SpoScf_15890 12 1376) will not be extracted because column 3 (1376 ) is less than 3376 but more than 1000 (1376 - 3376 = -2000 (should not be more than -1000 if want to be extracted).

0
Entering edit mode

Why is the proposed solution for your additional condition not working? What's happening that you don't want to happen?

For the sake of the clarity of this post, it'd be more useful to reply to a.zielinski's comment above if you run into errors running the modified script.

0
Entering edit mode

Are sure the output is correct? I think that only "SpoScf_15890 12 2376" fulfils the criteria (start or end in the file1 should be less than in file2, and the difference shouldn't be greater than 1000)