Question

how to combine two bed files using the same ID information

0

Entering edit mode

3.3 years ago

szp770 ▴ 10

Hi, now I have two bed files. Each have four columuns, and the fourth column has the uniq ID for each row, each file has thousands of rows. Now I want to combine row of the two files if the rows have the same ID, the final output should like this:

chr1    10028   10029    chr14   68314662        68314663     J00118:253:HJ2FTBBXX:3:2213:8491:13394

bed 1:

chr1    10028   10029   J00118:253:HJ2FTBBXX:3:2213:8491:13394
...

bed 2:

chr14   68314662        68314663        J00118:253:HJ2FTBBXX:3:2213:8491:13394
...

bed linux shell python • 1.4k views

ADD COMMENT • link updated 3.3 years ago by Alex Reynolds 35k • written 3.3 years ago by szp770 ▴ 10

finswimmer · Answer 1 · 2020-12-24

save below text to combine_bed_by_id.pl, then perl combine_bed_by_id.pl bed1.txt bed2.txt

use strict;
use warnings;

my ($bed1, $bed2) = @ARGV;

my %bed1 = read_bed($bed1);
my %bed2 = read_bed($bed2);

for my $id (sort keys %bed1){
    print "$bed1{$id}\t$bed2{$id}\t$id\n" if $bed2{$id};
}

sub read_bed{
    my $bed=shift;
    open IN,"$bed";
    my %f;
    while(<IN>){
        chomp;
        my @temp = split;
        my $id = pop @temp;
        my $info = join "\t", @temp;
        $f{$id} = $info;
    }
    return %f;
    close IN;
}

score 0 · Answer 2 · 2020-12-24

0

Entering edit mode

3.3 years ago

Pierre Lindenbaum 161k

join -t $'\t' -1 4 -2 4 <(sort -t $'\t' -k4,4 file1.bed) <(sort -t $'\t' -k4,4 file2.bed)

ADD COMMENT • link 3.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

That's really succint!

ADD REPLY • link 3.3 years ago by szp770 ▴ 10

0

Entering edit mode

... and the simplest I would go for. Considering that join output provides the ID in the first column, here's a minimum modification to exactly match the desired output:

join -t $'\t' -1 4 -2 4 <(sort -k4,4 file1.bed) <(sort -k4,4 file2.bed) | perl -pe 's/(\S+)\t(.+)/$2\t$1/'

ADD REPLY • link 3.3 years ago by Jorge Amigo 14k

0

Entering edit mode

here's a minimum modification to exactly match the desired output:

you can use the formatting option of join -o FORMAT to achieve the same result ;-)

ADD REPLY • link 3.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Good to know. Thank you Pierre.

ADD REPLY • link 3.3 years ago by Jorge Amigo 14k

0

Entering edit mode

Hey, what if I add the 5th column to each file and still want to join by the same 4th column value and reserve the 5th column information in the final result? Thanks!

ADD REPLY • link 3.3 years ago by szp770 ▴ 10

0

Entering edit mode

Pierre's answer would still work. It'll output columns 4, 1-3 and 5 of the first file, plus 1-3 and 5 of the second file. As Pierre mentioned, you may modify the column layout using the -o option. Here's an example that may help you understand hoy join output format works.

ADD REPLY • link 3.3 years ago by Jorge Amigo 14k

score 0 · Answer 3 · 2020-12-24

0

Entering edit mode

3.3 years ago

Jorge Amigo 14k

perl -lane '$d{$F[3]} .= "@F[0..2] "; END {
 foreach $i (keys %d) { print $d{$i}.$i }
}' file1.bed file2.bed | awk '$5~/./'

ADD COMMENT • link 3.3 years ago by Jorge Amigo 14k

score 0 · Answer 4 · 2020-12-24

0

Entering edit mode

3.3 years ago

Alex Reynolds 35k

Here's one that might be a little simpler to follow:

$ sort -k4,4 A.bed B.bed | paste -d "\t" - - | cut -f1-3,5-8 | sort-bed - > answer.bed

Here's how it works:

sort -k4,4 A.bed B.bed - sort the concatenation of A and B by the fourth column

paste -d "\t" - - - take every two lines from the output of sort and join them by a tab character

cut -f1-3,5-8 - take columns 1 through 3 and 5 through 8 of the output from paste

sort-bed - - sort the output lexicographically for downstream set operations

ADD COMMENT • link 3.3 years ago by Alex Reynolds 35k

0

Entering edit mode

That's really helpful, Thanks so much!

ADD REPLY • link 3.3 years ago by szp770 ▴ 10

0

Entering edit mode

Any pure sorting solution will only work if both files contain the same IDs and nothing else, as it will pair every 2 lines independently on their ID. A previous selection of IDs present in both files should be performed before using this code. Here's an example nesting 2 cut | grep, one to detect shared IDs and the other to print only lines containing those IDs:

cut -f4 A.bed | grep -F -f - B.bed | cut -f4 | grep -h -F -f - A.bed B.bed \
| sort -k4,4 | paste -d "\t" - - | cut -f1-3,5-8 | sort-bed - > answer.bed

... although this is basically what the join solution does.

ADD REPLY • link 3.3 years ago by Jorge Amigo 14k