Question: how to combine two bed files using the same ID information
0
gravatar for szp770
10 weeks ago by
szp7700
szp7700 wrote:

Hi, now I have two bed files. Each have four columuns, and the fourth column has the uniq ID for each row, each file has thousands of rows. Now I want to combine row of the two files if the rows have the same ID, the final output should like this:

chr1    10028   10029    chr14   68314662        68314663     J00118:253:HJ2FTBBXX:3:2213:8491:13394

bed 1:

chr1    10028   10029   J00118:253:HJ2FTBBXX:3:2213:8491:13394
...

bed 2:

chr14   68314662        68314663        J00118:253:HJ2FTBBXX:3:2213:8491:13394
...
python shell linux bed • 221 views
ADD COMMENTlink modified 10 weeks ago by Alex Reynolds31k • written 10 weeks ago by szp7700
0
gravatar for wangmingcheng1992
10 weeks ago by
China_LanZhou_lzu
wangmingcheng19920 wrote:

save below text to combine_bed_by_id.pl, then perl combine_bed_by_id.pl bed1.txt bed2.txt

use strict;
use warnings;

my ($bed1, $bed2) = @ARGV;

my %bed1 = read_bed($bed1);
my %bed2 = read_bed($bed2);

for my $id (sort keys %bed1){
    print "$bed1{$id}\t$bed2{$id}\t$id\n" if $bed2{$id};
}

sub read_bed{
    my $bed=shift;
    open IN,"$bed";
    my %f;
    while(<IN>){
        chomp;
        my @temp = split;
        my $id = pop @temp;
        my $info = join "\t", @temp;
        $f{$id} = $info;
    }
    return %f;
    close IN;
}
ADD COMMENTlink modified 10 weeks ago by finswimmer14k • written 10 weeks ago by wangmingcheng19920
0
gravatar for Pierre Lindenbaum
10 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:
join -t $'\t' -1 4 -2 4 <(sort -t $'\t' -k4,4 file1.bed) <(sort -t $'\t' -k4,4 file2.bed)
ADD COMMENTlink written 10 weeks ago by Pierre Lindenbaum134k

That's really succint!

ADD REPLYlink written 10 weeks ago by szp7700

... and the simplest I would go for. Considering that join output provides the ID in the first column, here's a minimum modification to exactly match the desired output:

join -t $'\t' -1 4 -2 4 <(sort -k4,4 file1.bed) <(sort -k4,4 file2.bed) | perl -pe 's/(\S+)\t(.+)/$2\t$1/'
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Jorge Amigo12k

here's a minimum modification to exactly match the desired output:

you can use the formatting option of join -o FORMAT to achieve the same result ;-)

ADD REPLYlink written 10 weeks ago by Pierre Lindenbaum134k

Good to know. Thank you Pierre.

ADD REPLYlink written 10 weeks ago by Jorge Amigo12k

Hey, what if I add the 5th column to each file and still want to join by the same 4th column value and reserve the 5th column information in the final result? Thanks!

ADD REPLYlink written 8 weeks ago by szp7700

Pierre's answer would still work. It'll output columns 4, 1-3 and 5 of the first file, plus 1-3 and 5 of the second file. As Pierre mentioned, you may modify the column layout using the -o option. Here's an example that may help you understand hoy join output format works.

ADD REPLYlink written 8 weeks ago by Jorge Amigo12k
0
gravatar for Jorge Amigo
10 weeks ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:
perl -lane '$d{$F[3]} .= "@F[0..2] "; END {
 foreach $i (keys %d) { print $d{$i}.$i }
}' file1.bed file2.bed | awk '$5~/./'
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Jorge Amigo12k
0
gravatar for Alex Reynolds
10 weeks ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Here's one that might be a little simpler to follow:

$ sort -k4,4 A.bed B.bed | paste -d "\t" - - | cut -f1-3,5-8 | sort-bed - > answer.bed

Here's how it works:

sort -k4,4 A.bed B.bed - sort the concatenation of A and B by the fourth column

paste -d "\t" - - - take every two lines from the output of sort and join them by a tab character

cut -f1-3,5-8 - take columns 1 through 3 and 5 through 8 of the output from paste

sort-bed - - sort the output lexicographically for downstream set operations

ADD COMMENTlink written 10 weeks ago by Alex Reynolds31k

That's really helpful, Thanks so much!

ADD REPLYlink written 10 weeks ago by szp7700

Any pure sorting solution will only work if both files contain the same IDs and nothing else, as it will pair every 2 lines independently on their ID. A previous selection of IDs present in both files should be performed before using this code. Here's an example nesting 2 cut | grep, one to detect shared IDs and the other to print only lines containing those IDs:

cut -f4 A.bed | grep -F -f - B.bed | cut -f4 | grep -h -F -f - A.bed B.bed \
| sort -k4,4 | paste -d "\t" - - | cut -f1-3,5-8 | sort-bed - > answer.bed

... although this is basically what the join solution does.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Jorge Amigo12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1615 users visited in the last hour
_