Question

Is It Possible To Filter Only Bookend Reads From A Bed File?

0

Entering edit mode

10.2 years ago

Duarte Molha ▴ 240

I have a bed file with many fragments, some overlapping, some on their own and some adjacent to each other (book-ended) features.

I know can group overlapping and book-ended features using bedtools like

bedtools cluster -i fragments.bed

However I was wondering if anyone knew of a way of obtaining from the input file only the fragments that contain book-ended adjacent fragments.

Any ideas?

Best regards

bedtools filtering scripting • 3.6k views

ADD COMMENT • link updated 10.2 years ago by Alex Reynolds 35k • written 10.2 years ago by Duarte Molha ▴ 240

score 3 · Answer 1 · 2014-01-31

You can use the closest tool in bedtools, ignore overlapping features (-io) and require the distance (-d) to be 1. I would recommend using verion 2.18.2 or later.

$ cat i.bed
chr1    10    20    a
chr1    22    24    b
chr1    24    26    c
chr1    26    28    d
chr1    40    50    e
chr1    60    70    f
chr1    65    75    g
chr1    75    80    h

$ bedtools closest -a i.bed -b i.bed -io -d | awk '$9==1'
chr1    22    24    b    chr1    24    26    c    1
chr1    24    26    c    chr1    22    24    b    1
chr1    24    26    c    chr1    26    28    d    1
chr1    26    28    d    chr1    24    26    c    1
chr1    65    75    g    chr1    75    80    h    1
chr1    75    80    h    chr1    65    75    g    1

If you want a non-redundant set, just do the following:

$ bedtools closest -a i.bed -b i.bed -io -d | awk '$9==1' | awk '$2<$6'
chr1    22    24    b    chr1    24    26    c    1
chr1    24    26    c    chr1    26    28    d    1
chr1    65    75    g    chr1    75    80    h    1

score 1 · Answer 2 · 2014-01-31

This is pretty straightforward with BEDOPS:

sort-bed file.bed > file.sort.bed
closest-features --no-overlaps --dist --closest file.sort.bed file.sort.bed | awk -F'|' '$3*$3==1'

You basically ask what elements are closest 'to the left' and 'to the right' of every element in the file using an edge-2-edge measure (by running closest-features on the file against itself). If the distance is 1 or -1, then you have found an element that you're looking for. The output is still sorted per sort-bed so you can use immediately in further downstream BEDOPS operations.

Perhaps the only slightly tricky part is in using --no-overlaps. This ensures that elements declared closest 'to the left/right' are disjoint (which includes immediately adjacent elements) from the reference element. The awk math trick is needed because there is no built-in absolute value function in awk and closest-features returns the signed distance to the nearest element.

$ cat file.sort.bed
chr1    10  20  h
chr1    10  25  i
chr1    20  25  j
chr1    25  50  k
chr1    30  35  l
chr1    31  45  m
chr1    32  46  n

$ closest-features --no-overlaps --dist --closest file.sort.bed file.sort.bed | awk -F'|' '$3*$3==1'
chr1    10  20  h|chr1  20  25  j|1
chr1    10  25  i|chr1  25  50  k|1
chr1    20  25  j|chr1  10  20  h|-1
chr1    25  50  k|chr1  20  25  j|-1

since we used --closest, only the single nearest element is chosen, with ties 'going to the left'. It's also worth noting that closest-features has essentially no memory overhead and it's wicked fast. Use it with any size inputs without issue.