Question: Best to overlap hundreds of BED files to one master BED file?
2
gravatar for Ian
3.6 years ago by
Ian5.3k
University of Manchester, UK
Ian5.3k wrote:

I have in excess of 400 BED files of transcription factor binding site coordinates that I want to compare with one master BED file of binding regions.  The aim is to identify which of the ~400 TFBS overlap with the master file.

Normally I would go straight to bedtools, but given the number of TFBS files I wonder if there is a "better" method?  bedtools intersect would appear to do the trick, but for 400 files....?

The type of output I am looking to get is:

master_pos1    TFBS8, TFBS16, TFBS200
master_pos2    TFBS1, TFBS333
etc.

Thank you.

 

 

 

 

intersect bed overlap • 1.6k views
ADD COMMENTlink modified 3.6 years ago by Alex Reynolds27k • written 3.6 years ago by Ian5.3k
3
gravatar for Alex Reynolds
3.6 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

If your BED files are sorted, you could use bedops to union all the TFBS to standard output, and pipe that result to bedmap to do the mapping of TFs to master regions.

(This technique assumes that the TFBS files are minimally BED4. That is, the fourth column in each TFBS file contains the ID of the TF. If that is not the case, describe your format in more detail and I'll suggest a quick one-liner with awk to fix up files into the correct form.)

Here's the one-liner that unions and maps:

$ bedops --everything tfbs*.bed | bedmap --echo --echo-map-id-uniq --delim '\t' master.bed - > answer.bed

Piping to standard output avoids the unnecessary step of making an intermediate file somewhere on the hard drive, which is otherwise very expensive in time. So this should be very fast.

Assuming that the ID fields in each of tfbs001.bed through tfbs400.bed contain the desired TF names or other identifiers of choice, the file answer.bed contains the results as you expect, except that it uses a semi-colon as an ID delimiter, instead of a comma. You could add --multidelim ',' to the bedmap statement, if that is a requirement.

If your BED files are not sorted, you could first prepare them with BEDOPS sort-bed, which is faster at sorting BED files than GNU sort.

$ for tfbs_fn in `ls tfbs*.bed`; do sort-bed $tfbs_fn > sorted.$tfbs_fn; done
$ sort-bed master.bed > sorted.master.bed

Then use the sorted files in downstream BED ops. You only need to sort once.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Alex Reynolds27k

Great method, I shall have to look more into what bedops offers.  Just wondering why '+t' appears at the start of the last column?  E.g. '+tCREB1;CST6;'.  I have used 'sed' to remove it for the moment.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Ian5.3k
1

Do your BED files come from Excel or Windows? Such files usually need to be cleaned up. You might take a look at the suggestions in my answer in this thread: bedmap output on one line

ADD REPLYlink written 3.6 years ago by Alex Reynolds27k

I ran the one-liner, apparently with success, however there are no clusters on chromosomes 10-22 (human), even though there are TFBS on those chromosomes in the bedops input.  Is there any obvious reason when I am seeing this? -- IGNORE I failed to also sort the master file --

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Ian5.3k

Yeah, just run sort-bed on BED files and you'll be fine.

ADD REPLYlink written 3.6 years ago by Alex Reynolds27k
1
gravatar for dariober
3.6 years ago by
dariober9.9k
WCIP | Glasgow | UK
dariober9.9k wrote:

That's a synopsis of how I would do it:

  • Intersect each tfbs file with the master file, to each output file add a column with the tfbs id (here the file name itself), like: sort -k1,1 -k2,2n mytfbs.bed | intersectBed -sorted -u -b - -a master.bed | awk -v tfbs=mytfbs.bed 'print $0 "\t" tfbs' > master.mytfbs.bed Skip the sort step if the tfbs files are already sorted (master.bed should be sorted as well). Use the appropriate options for intersectBed of course.
  • Merge sort all the output files and pipe to mergeBed to get the desired output: sort -k1,1 -k2,2n -m master.*.bed | mergeBed -i - -c 4 -o collapse > out.tfbs.bed

(Not tested)

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by dariober9.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 831 users visited in the last hour