Question: How can I ask bedops element-of to add additional informations in its output ?
0
gravatar for Nicolas Rosewick
3.6 years ago by
Belgium, Brussels
Nicolas Rosewick7.5k wrote:

Hi,

I've two sorted bed files: One with all gene coordinates, one with defined genomic windows. In the second file, several genomic windows can overlap such as :

gene File
chr1    1400    1500    geneA
chr1    1900    2000    geneB

window file
chr1    1000    2000    sample1
chr1    1200    2200    sample2

When I perform bedops -e -1 gene.bed window.bed it returns me

chr1    1400    1500    geneA
chr1    1900    2000    geneB

But is it possible to ask bedops to return the results for each window ? like this

chr1    1400    1500    geneA    sample1
chr1    1900    2000    geneB    sample1
chr1    1400    1500    geneA    sample2
chr1    1900    2000    geneB    sample2

My window bed file is very big (millions of entry), so a loop across the window file and execute bedops on each line is not the best option I think

Thanks

bedops • 1.1k views
ADD COMMENTlink modified 3.6 years ago by Alex Reynolds28k • written 3.6 years ago by Nicolas Rosewick7.5k
1
gravatar for Alex Reynolds
3.6 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Yes, this is what bedmap is for, mapping or relating one set of BED elements to another:

$ bedmap --echo --echo-map-id --delim '\t' genes.bed windows.bed > answer.bed

The result:

$ more answer.bed
chr1    1400    1500    geneA    sample1;sample2
chr1    1900    2000    geneB    sample1;sample2

To explain the bedmap statement, the --echo option prints each gene from genes.bed. The --echo-map-id option prints the ID value of any mapped (overlapping) elements from windows.bed. The --delim option separates the gene and ID result with a tab character.

If you need the list of window IDs on separate lines, you can pipe the output to awk or other scripting tools to do this. Here is an untested script to get it into your desired format:

$ bedmap --echo --echo-map-id --delim '\t' genes.bed windows.bed \
   | awk '{ \
            n_ids = split($5, ids, ";"); \
            for (idx = 1; idx <= n_ids; idx++) { \
              print $1"\t"$2"\t"$3"\t"$4"\t"ids[idx]; \
            } \
          }' - \
   | sort-bed - \
   > split_answer.bed

The result:

$ more answer.bed
chr1    1400    1500    geneA    sample1
chr1    1400    1500    geneA    sample2
chr1    1900    2000    geneB    sample1
chr1    1900    2000    geneB    sample2

Note the use of sort-bed at the end of the pipeline, to ensure the output is a sorted BED file that you can use with other BEDOPS tools. If you don't care about the sort order of the output, you can remove the sort-bed call.

Piping avoids making intermediate files, which is time-consuming, and so it is a good way to deal with very big BED files. You might first test things out with a small subset of your input files to make sure the pipeline works, then run your full input files through the working pipeline.

ADD COMMENTlink modified 3.0 years ago • written 3.6 years ago by Alex Reynolds28k

thanks that's perfect !

ADD REPLYlink written 3.6 years ago by Nicolas Rosewick7.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 961 users visited in the last hour