Question

How can I ask bedops element-of to add additional informations in its output ?

0

Entering edit mode

9.8 years ago

Nicolas Rosewick 11k

Hi,

I've two sorted bed files: One with all gene coordinates, one with defined genomic windows. In the second file, several genomic windows can overlap such as :

gene File
chr1    1400    1500    geneA
chr1    1900    2000    geneB

window file
chr1    1000    2000    sample1
chr1    1200    2200    sample2

When I perform bedops -e -1 gene.bed window.bed it returns me

chr1    1400    1500    geneA
chr1    1900    2000    geneB

But is it possible to ask bedops to return the results for each window ? like this

chr1    1400    1500    geneA    sample1
chr1    1900    2000    geneB    sample1
chr1    1400    1500    geneA    sample2
chr1    1900    2000    geneB    sample2

My window bed file is very big (millions of entry), so a loop across the window file and execute bedops on each line is not the best option I think

Thanks

bedops • 2.2k views

ADD COMMENT • link updated 9.8 years ago by Alex Reynolds 36k • written 9.8 years ago by Nicolas Rosewick 11k

score 1 · Accepted Answer · 2015-10-01

Yes, this is what bedmap is for, mapping or relating one set of BED elements to another:

$ bedmap --echo --echo-map-id --delim '\t' genes.bed windows.bed > answer.bed

The result:

$ more answer.bed
chr1    1400    1500    geneA    sample1;sample2
chr1    1900    2000    geneB    sample1;sample2

To explain the bedmap statement, the --echo option prints each gene from genes.bed. The --echo-map-id option prints the ID value of any mapped (overlapping) elements from windows.bed. The --delim option separates the gene and ID result with a tab character.

If you need the list of window IDs on separate lines, you can pipe the output to awk or other scripting tools to do this. Here is an untested script to get it into your desired format:

$ bedmap --echo --echo-map-id --delim '\t' genes.bed windows.bed \
   | awk '{ \
            n_ids = split($5, ids, ";"); \
            for (idx = 1; idx <= n_ids; idx++) { \
              print $1"\t"$2"\t"$3"\t"$4"\t"ids[idx]; \
            } \
          }' - \
   | sort-bed - \
   > split_answer.bed

The result:

$ more answer.bed
chr1    1400    1500    geneA    sample1
chr1    1400    1500    geneA    sample2
chr1    1900    2000    geneB    sample1
chr1    1900    2000    geneB    sample2

Note the use of sort-bed at the end of the pipeline, to ensure the output is a sorted BED file that you can use with other BEDOPS tools. If you don't care about the sort order of the output, you can remove the sort-bed call.

Piping avoids making intermediate files, which is time-consuming, and so it is a good way to deal with very big BED files. You might first test things out with a small subset of your input files to make sure the pipeline works, then run your full input files through the working pipeline.