Question: imposing occupancy values from one bed file to different intervals
gravatar for chrisclarkson100
3.6 years ago by
European Union
chrisclarkson10080 wrote:

I have a txt file:

chr start end superfluous_data
chr1 3000000 3039999 0.00585524735801591 
chr1 3040000 3079999 0.00462068257738901 
chr1 3080000 3119999 0.00410291608104423 
chr1 3120000 3159999 0.00445902789765337

I manipulated the file as I am only interested in the intervals:

awk '{print $1,'\t',$2,'\t',$3}' data/file.txt > intervals_of_interest.bed

I wanted to get the occupancy values (specified by a different bed file) of a particular protein at these intervals.

Heterochromatin.bed (genome-wide):

chr1    3049360 3053345 Region_1        0       0
chr1    3136664 3138809 Region_2        0       0
chr1    3786627 3791240 Region_4        0       0
chr1    4164204 4167731 Region_5        0       0
chr1    4599546 4604437 Region_7        0       0
chr1    5355834 5360997 Region_10       0       0

My attempt to align and assign the occupancy values to the region of interest is as follows:

bedmap --echo --echo-map-id-uniq intervals_of_interest.bed ../Heterochromatin.bed

oddly the output looks like the below

chr     1       0|
tart    1       0|
nd      1       0|
hr1     3000000 3039999|
chr1    3040000 3079999|
chr1    3080000 3119999|
chr1    3120000 3159999|
chr1    3160000 3199999|
chr1    3200000 3239999|

(but more worrying is the fact that I can't seem to assign calculated occupancy values to these intervals):

Can anyone tell me if it is possible to re-calculate occupancy values from one bed file and map to different intervals?


assembly • 754 views
ADD COMMENTlink modified 2.1 years ago by Biostar ♦♦ 20 • written 3.6 years ago by chrisclarkson10080
gravatar for Alex Reynolds
3.6 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

If you have a header in the first line of intervals_of_interest.bed, use tail to strip it out, otherwise you will get a bogus result when using that file downstream:

$ awk '{print $1,'\t',$2,'\t',$3}' data/file.txt | tail -n+2 | sort-bed - > intervals_of_interest.bed

Also make sure that Heterochromatin.bed is sorted, if its sort order is unknown:

$ sort-bed Heterochromatin.unsorted.bed > Heterochromatin.bed

The subsequent bedmap command will return the unique ID values from the map file Heterochromatin.bed — values in the map file's fourth column — where there are overlaps with reference intervals:

$ bedmap --echo --echo-map-id-uniq intervals_of_interest.bed ../Heterochromatin.bed > answer.bed

If there are no overlaps between the reference interval of interest and the map file, you get an empty result, as your example run seems to correctly show.

If you want the file answer.bed to leave out intervals of interest from the result, where there are no overlaps with the map file (Heterochromatin.bed), add --skip-unmapped:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped intervals_of_interest.bed ../Heterochromatin.bed > answer.bed

If you instead wanted (occupancy) signal or score data from the map file — values from the map file's fifth column — use --echo-map-score instead of --echo-map-id-uniq:

$ bedmap --echo --echo-map-score intervals_of_interest.bed ../Heterochromatin.bed > answer.bed

Again, add the --skip-unmapped option if you do not want the result to contain intervals-of-interest with no overlaps.

If you instead wanted to calculate a summary statistic from score data of overlapping map elements, like a mean or standard deviation, replace --echo-map-score with --mean, --stdev, etc., e.g.:

$ bedmap --echo --mean --skip-unmapped intervals_of_interest.bed ../Heterochromatin.bed > answer.bed

See bedmap --help or the online docs for a full listing of --echo-map-* and score-based statistical operations.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Alex Reynolds30k

Thank you for your insightful answer- Can I ask: after having used:

bedmap --echo --echo-map-score --skip-unmapped intervals_of_interest.bed ../Heterochromatin.bed > answer.bed

I get the following:

chr1    5520000 5559999|0.000000
chr1    5560000 5599999|0.000000
chr1    5800000 5839999|0.000000;0.000000
chr1    5960000 5999999|0.000000;0.000000
chr1    7600000 7639999|0.000000;0.000000;0.000000;0.000000
chr1    7640000 7679999|0.000000

Why do I get more than one value for some?

ADD REPLYlink written 3.6 years ago by chrisclarkson10080

You have one or more overlapping elements from Heterochromatin.bed, which overlap each interval of interest.

Where there are two or more overlaps, the score values of all overlapping elements are separated by a semi-colon character.

If you want to debug things and see exactly which elements are overlapping each interval-of-interest, use --echo-map in place of --echo-map-score.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Alex Reynolds30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1755 users visited in the last hour