Question

Find overlapping coordinates between a bed file and multiple files in order and associate the coordinates with the original coordinate in tsv

0

Entering edit mode

6.8 years ago

skim ▴ 60

Hello, I have posted a question here but I found out that bedtools intersect was not sufficient for me. I need to make a tab separated table in format like below

I have one original bed file and 125 cell line bed files that I have to check overlaps and list the overlapping coordinates as a new file in the following format:

original bed file: an ordinary bed6 file

chromosome start end nametag score strand(+/-)

ex) chr1 1000001 1000095 RBP7 70.69 +

cellline bed file: also an ordinary bed6 or bed8 file

what i have to make:

I have to make a new tsv file that appends values to the right of each bed coordinate.

chromosome start end nametag score strand(+/-) cellline1 cellline2 ....etc..... cellline125 Total_celline_overlap_count

ex) chr1 1000001 1000095 RBP7 70.69 + chr1:1000001-1000004 None ...etc.... chr1:1000003-1000015 60

basically, I have to write down in every line 1) whether this original coordinate has an overlap at the given cell line and 2) if there is, write down the overlaps.

Therefore there are 6+125+1 columns in this file.

I found out that bedtools intersect does not provide this function and even if I could code in python to do this, the time consumption is very inefficient if i use the files generated in bedtools intersect. I used glob package to search the 125 files in my directory and I used csv package to make my custom tsv file but they were insufficient to do this job with bedtools intersect. Can there be any help for me? Thank you very much.

overlap sequence bed next-gen • 4.2k views

ADD COMMENT • link updated 6.8 years ago by Alex Reynolds 35k • written 6.8 years ago by skim ▴ 60

1

Entering edit mode

All you need is original bedtools intersect and linux join command.

Algorithm goes like that:

for i in cellLineBed {
    # you need wao to get all reports in -a 
    # possible to overcome this with linux join
    bedtoolsIntersect -a originalBedFile -b ${i}_cellLineBed -wao > result_${i}
}
join result_* by originalBedFile (chromosome start end nametag)

I haven't used join in a long time and don't remember it's syntax, but it should do what you want

ADD REPLY • link 6.8 years ago by PoGibas 5.1k

score 1 · Answer 1 · 2017-07-26

BEDOPS does this efficiently.

First, make a file called union.bed that contains a custom ID built from the chromosome, start and stop of each element. Then map that file to original.bed, retrieving IDs of overlapping elements (--echo-map-id-uniq) and the total count of overlaps (--count):

$ bedops --everything cellLine*.bed | awk '{ print $1"\t"$2"\t"$3"\t0\t"$1":"$2"-"$3; }' > union.bed
$ bedmap --echo --echo-map-id-uniq --count --delim '\t' --multidelim '\t' original.bed union.bed > answer.bed

An even more efficient approach would use a process substitution, which avoids creating the intermediate file union.bed:

$ bedmap --echo --echo-map-id-uniq --count --delim '\t' --multidelim '\t' original.bed <(bedops --everything cellLine*.bed | awk '{ print $1"\t"$2"\t"$3"\t0\t"$1":"$2"-"$3; }') > answer.bed

Use sort-bed to sort your input BED files original.bed and cellLine*.bed before using bedops and bedmap, if the inputs are not sorted or if their sort order is unknown, e.g.:

$ sort-bed original.unsorted.bed > original.bed