Merge intervals between but NOT within 2 BED files
Entering edit mode
3.8 years ago
Anand Rao ▴ 520

I have BED files for 2 types of coords: A and B. I want to collapse them into intervals where an entry of A is separated from an entry of B by not more than 20, and vice versa. What is not acceptable is a merge of coordinates with 2 or more entries from just A or 2 or more entries from just B.

For example, let's say:

cat A.bed 
Chr1    10  20  A1
Chr1    50  60  A2
Chr1    75  100 A3


cat B.bed 
Chr1    25  40  B1
Chr1    115 160 B2


cat A.bed B.bed > AB.bed
sortBed -i AB.bed > AB_sort.bed

The output from mergeBed with -d 20 gives what I do NOT want!

mergeBed -d 20 -i AB_sort.bed -c 4 -o collapse -delim ","
Chr1    10  160 A1,B1,A2,A3,B2

Results look different when I visually parse bedtools closest results seen below

bedtools closest -d -a A.bed -b B.bed 
Chr1    10  20  A1  Chr1    25  40  B1  6
Chr1    50  60  A2  Chr1    25  40  B1  11
Chr1    75  100 A3  Chr1    115 160 B2  16

What combinations of bedtools's sub-commands should I use to get the type of result that I seek, shown below:

Chr1    10  60  A1,B1,A2
Chr1    75  160 A3,B2

Thanks to Alex for pointing out an error, changed from A3,B1 to A3,B2 in the line above

In my manually parsed output example above, I do NOT collapse coords for features A2 and A3 together (the rule mentioned in the intro) whereas mergeBed is agnostic of this. Any ideas on how to use bedtools and / or bedOps?

note to self: Images (with and without annotation) to help visualize this example has been added below...

Raw Image for example Visualization of example

Annotated image for example AnnotatedImage

bedtools merge overlap bedops • 1.1k views
Entering edit mode

Can you explain the A3,B1 part of your expected result? That doesn't make sense in the context of your question.

You can do this with bedmap very easily, I think, but I need to understand if your expected output is correct, first.

Entering edit mode

Thank you for pointing that out, Alex. I fixed the error above. Look forward to your response. Thanks, in advance.

Entering edit mode

I think this is complicated enough to warrant a custom python/perl/R/ruby/haskell/FORTRAN program :-)

Entering edit mode
3.8 years ago

I think this logic should work, so long as elements in B.bed are not within each other by 20*2 (40) bases, and elements in A.bed and B.bed are sorted with sort-bed (not sortBed):

$ bedmap --echo-map --range 20 --delim '\t' --echo A.bed B.bed \
    | sort -k4,4 \
    | awk -vOFS="\t" -vFS="\t" '{ \
          a[$4]["chr"] = $1; \
          a[$4]["start"] = ($2 < $6) ? ($2 < a[$4]["start"] ? $2 : a[$4]["start"]) : $6; \
          a[$4]["stop"] = ($3 > $7) ? ($3 > a[$4]["stop"] ? $3 : a[$4]["stop"]) : $7; \
          a[$4]["ids"] = a[$4]["ids"] ":" $8; \
      } \
      END { \
          for (i in a) { \
              print a[i]["chr"], a[i]["start"], a[i]["stop"], a[i]["ids"]":"i; \
          } \
      }' \
    | sort-bed - \
    > answer.bed

Test run:

$ cat answer.bed
Chr1    10  60  :A1:A2:B1
Chr1    75  160 :A3:B2

If elements in B.bed are within 40 bases or less, then I think the result could be muddled by multiple overlap categories.

Also, this isn't strictly a merge, but simply taking the min-max bounds of all spanning elements in a "span group". There could be gaps within these bounds (as in this example) that a strict merge would be split upon.

Entering edit mode

Thank you. I've got to think about this, using real data. Because there is no restriction to separation between elements in real-life B.bed data....Thanks again...Like Ram suggested, I might have to cobble up some code for these customized analyses...


Login before adding your answer.

Traffic: 496 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6