Here is a solution using cascalog, which provides a high level
query interface on top of Hadoop fully integrated with the
Clojure programming language. By abstracting away the map/reduce
details, it eases implementation and understandability of the
code. You can grab the full code example from GitHub.
First define the query. This is two functions: testing when one
start end coordinate overlaps with a second and writing the cascalog
query that uses this to return only those chromosome start end pairs
in the first that overlap with the second:
(deffilterop overlaps [s1 e1 s2 e2]
  "Pass filter if s1,e1 start end pair overlaps s2,e2 pair."
  (or (and (>= s1 s2) (< s1 e2))
      (and (>= e1 s2) (< e1 e2))))
(defn run-bed-overlap [bed-one bed-two]
  "Cascalog query to pull data from two bed files and intersect."
  (?<- (stdout) [?chr ?s1 ?e1]
       (bed-one ?chr ?s1 ?e1)
       (bed-two ?chr ?s2 ?e2) ; Matches chromosome with bed-one
       (overlaps ?s1 ?e1 ?s2 ?e2)))
The other part is writing the plumbing to parse BED files and feed
them into the run-bed-overlap function:
(defmapop parse-bed-line [line]
  (take 3 (split line #"\t")))
(defn to-int [x] (Integer/parseInt x))
(defn bed-from-hfs [dir]
  "With a BED file from HDFS, produce chromsome, start, end."
  (let [source (hfs-textline dir)]
    (<- [?chr ?s ?e]
        (source ?line)
        (parse-bed-line ?line :> ?chr ?s-str ?e-str)
        (to-int ?s-str :> ?s)
        (to-int ?e-str :> ?e))))
(defn bed-hfs-overlap [dir1 dir2]
  (run-bed-overlap (bed-from-hfs dir1) (bed-from-hfs dir2)))
Push your bed files to HDFS and run on Hadoop with:
% cd /path/to/bed-hadoop
% lein deps
% lein uberjar
% hadoop fs -mkdir /tmp/bed-hadoop/bed-1
% hadoop fs -mkdir /tmp/bed-hadoop/bed-2
% hadoop fs -put test/one.bed /tmp/bed-hadoop/bed-1
% hadoop fs -put test/two.bed /tmp/bed-hadoop/bed-2
% hadoop jar bed-hadoop-0.0.1-SNAPSHOT-standalone.jar
             bed_hadoop.core /tmp/bed-hadoop/bed-1 /tmp/bed-hadoop/bed-2
RESULTS
----------
chr1 20 30
chr2 40 50
----------
                    
                
                 
i don't know how anyone writes this stuff (Clojure)