Question

imposing occupancy values from one bed file to different intervals

0

Entering edit mode

8.7 years ago

chrisclarkson100 ▴ 160

I have a txt file:

chr start end superfluous_data
chr1 3000000 3039999 0.00585524735801591 
chr1 3040000 3079999 0.00462068257738901 
chr1 3080000 3119999 0.00410291608104423 
chr1 3120000 3159999 0.00445902789765337

I manipulated the file as I am only interested in the intervals:

awk '{print $1,'\t',$2,'\t',$3}' data/file.txt > intervals_of_interest.bed

I wanted to get the occupancy values (specified by a different bed file) of a particular protein at these intervals.

Heterochromatin.bed (genome-wide):

chr1    3049360 3053345 Region_1        0       0
chr1    3136664 3138809 Region_2        0       0
chr1    3786627 3791240 Region_4        0       0
chr1    4164204 4167731 Region_5        0       0
chr1    4599546 4604437 Region_7        0       0
chr1    5355834 5360997 Region_10       0       0

My attempt to align and assign the occupancy values to the region of interest is as follows:

bedmap --echo --echo-map-id-uniq intervals_of_interest.bed ../Heterochromatin.bed

oddly the output looks like the below

chr     1       0|
tart    1       0|
nd      1       0|
hr1     3000000 3039999|
chr1    3040000 3079999|
chr1    3080000 3119999|
chr1    3120000 3159999|
chr1    3160000 3199999|
chr1    3200000 3239999|

(but more worrying is the fact that I can't seem to assign calculated occupancy values to these intervals):

Can anyone tell me if it is possible to re-calculate occupancy values from one bed file and map to different intervals?

Thanks

Assembly • 1.8k views

ADD COMMENT • link updated 7.3 years ago by Biostar 20 • written 8.7 years ago by chrisclarkson100 ▴ 160

score 1 · Answer 1 · 2017-02-27

If you have a header in the first line of intervals_of_interest.bed, use tail to strip it out, otherwise you will get a bogus result when using that file downstream:

$ awk '{print $1,'\t',$2,'\t',$3}' data/file.txt | tail -n+2 | sort-bed - > intervals_of_interest.bed

Also make sure that Heterochromatin.bed is sorted, if its sort order is unknown:

$ sort-bed Heterochromatin.unsorted.bed > Heterochromatin.bed

The subsequent bedmap command will return the unique ID values from the map file Heterochromatin.bed — values in the map file's fourth column — where there are overlaps with reference intervals:

$ bedmap --echo --echo-map-id-uniq intervals_of_interest.bed ../Heterochromatin.bed > answer.bed

If there are no overlaps between the reference interval of interest and the map file, you get an empty result, as your example run seems to correctly show.

If you want the file answer.bed to leave out intervals of interest from the result, where there are no overlaps with the map file (Heterochromatin.bed), add --skip-unmapped:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped intervals_of_interest.bed ../Heterochromatin.bed > answer.bed

If you instead wanted (occupancy) signal or score data from the map file — values from the map file's fifth column — use --echo-map-score instead of --echo-map-id-uniq:

$ bedmap --echo --echo-map-score intervals_of_interest.bed ../Heterochromatin.bed > answer.bed

Again, add the --skip-unmapped option if you do not want the result to contain intervals-of-interest with no overlaps.

If you instead wanted to calculate a summary statistic from score data of overlapping map elements, like a mean or standard deviation, replace --echo-map-score with --mean, --stdev, etc., e.g.:

$ bedmap --echo --mean --skip-unmapped intervals_of_interest.bed ../Heterochromatin.bed > answer.bed

See bedmap --help or the online docs for a full listing of --echo-map-* and score-based statistical operations.