manipulating BED occupancy files' intervals
2
0
Entering edit mode
8.0 years ago

I am new to BED files. Is there a way to manipulate a BED file such that the regular intervals that document e.g. occupancy levels will change.

chr1    9825    9850    id-1    0.520000
chr1    9850    9875    id-2    1.760000
chr1    9875    9900    id-3    2.000000
chr1    9900    9925    id-4    2.000000
chr1    9925    9950    id-5    2.000000

converted to intervals of 75 rather than 25

chr1    9825    9800    id-1    averaged_value
chr1    9875    9950    id-2    averaged_value
chr1    10025    10100    id-3    averaged_value
chr1    10175    10250    id-4    averaged_value
etc..

The above conversion is just an example. It would be preferable to have the final product file in intervals of 10,000

alignment Assembly • 2.3k views
ADD COMMENT
1
Entering edit mode

your second intervals file is malformed. first line

chr1    9825    9800

looks like it should be

chr1    9725    9800

you can detect invalid regions' definitions by simply checking their 2nd and 3rd columns:

awk '$3<=$2' intervals.bed
ADD REPLY
3
Entering edit mode
8.0 years ago

You can use BEDOPS bedops and bedmap to do this efficiently in one pass.

You could use bedops --merge to merge the intervals, pipe them to bedops --chop again to split them up as you desire, and pipe them to bedmap --mean to get the mean signal over the newly split intervals.

For instance, given a BED file called signal.bed that contains your original intervals and their signal, to get the mean signal over intervals in 75-base chunks, you could do the following:

$ bedops --merge signal.bed | bedops --chop 75 - | bedmap --echo --mean --delim '\t' - signal.bed > answer.bed

Or do the following to get an answer over 10000-base sized chunks:

$ bedops --merge signal.bed | bedops --chop 10000 - | bedmap --echo --mean --delim '\t' - signal.bed > answer.bed

If you need row IDs, you can add the --echo-ref-row-id option to bedmap, which prints id- followed by the line number of the re-split interval:

$ bedops --merge signal.bed | bedops --chop 10000 - | bedmap --echo --echo-ref-row-id --mean --delim '\t' - signal.bed > answer.bed

Or use awk to get the same result:

$ bedops --merge signal.bed | bedops --chop 10000 - | bedmap --echo --mean --delim '\t' - signal.bed | awk '{ print $1"\t"$2"\t"$3"\tid-"NR"\t"$4; }' - > answer.bed
ADD COMMENT
0
Entering edit mode

This works! thank you but there's no link to upvote your answer

ADD REPLY
0
Entering edit mode

you can upvote comments by clicking on the little thumbs-up hand at the top left corner of each comment. I've just converted Alex's top-post comment to an answer in case you would like not only to upvote it, but also to accept it as the best answer.

ADD REPLY
0
Entering edit mode

Hi I recently tried dividing a file with 2964826300 lines into intervals of a million:

head sorted.bed
chr1    3000004 3000055 -
chr1    3000018 3000068 +
chr1    3000019 3000069 +
chr1    3000020 3000070 -
chr1    3000024 3000073 -

bedops --merge sorted.bed | bedops --chop 100000 - | bedmap --echo --mean --delim '\t' - sorted.bed > chopped_100000.bed

head chopped_100000.bed 
chr1    3000004 30020068706578497747938881449877737314362882630863587747513275139113652758508894266166904062754451450323262076938276160844316701266975455819838009019673671515494761275802674881615046553632768.000000
chr1    3002054 3004071 NAN
chr1    3004192 3014732 NAN
chr1    3014735 3034122 NAN
chr1    3034128 3040319 NAN

The rest of the file is fine but the first row is worrying... is something wrong with what I did?

ADD REPLY
0
Entering edit mode

You appear to use the --mean option on input that will not have score data in the fifth column. It looks like there is no numerical score or signal information in the fifth column of sorted.bed. I would check that input or remove --mean.

ADD REPLY
1
Entering edit mode
8.0 years ago

if you place your desired intervals in a 3columns-only bedfile, you can use bedtools map to get your desired output:

$ cat intervals.bed
chr1    9725    9800
chr1    9875    9950
chr1    10025   10100
chr1    10175   10250

$ cat input.bed
chr1    9825    9850    id-1    0.520000
chr1    9850    9875    id-2    1.760000
chr1    9875    9900    id-3    2.000000
chr1    9900    9925    id-4    2.000000
chr1    9925    9950    id-5    2.000000

$ bedtools map -c 5 -o mean -a intervals.bed -b input.bed
chr1    9725    9800    .
chr1    9875    9950    2
chr1    10025   10100   .
chr1    10175   10250   .

$ bedtools map -c 5 -o mean -a intervals.bed -b input.bed | perl -pe '$c++; s/(\S+)$/id-$c\t$1/'
chr1    9725    9800    id-1    .
chr1    9875    9950    id-2    2
chr1    10025   10100   id-3    .
chr1    10175   10250   id-4    .

the -o option tells map to perform a mean, the -c option tells map to perform it in the 5th column of the -b file, and the final perl writes the id-number column as requested.

ADD COMMENT
0
Entering edit mode

Hi there, Apologies for the late reply. While your method works for a small example file such as your example 'intervals.bed' I am actually required to specify perform this procedure for the entire genome bed file and hence I can't write out my desired intervals manually- do you know of a more practical way of doing this?

ADD REPLY
0
Entering edit mode

why not? if you have your genome bed:

$ cat human_g1k_v37.bed
1       0       249250621
2       0       243199373
3       0       198022430
4       0       191154276
5       0       180915260
6       0       171115067
7       0       159138663
8       0       146364022
9       0       141213431
10      0       135534747
11      0       135006516
12      0       133851895
13      0       115169878
14      0       107349540
15      0       102531392
16      0       90354753
17      0       81195210
18      0       78077248
19      0       59128983
20      0       63025520
21      0       48129895
22      0       51304566
X       0       155270560
Y       0       59373566

you can easily create your intervals:

$ sed 's/^/chr/' human_g1k_v37.bed \
| perl -lane '$w = 10000; $i = 0;
while ($i + $w < $F[2]) { print join "\t", $F[0], $F[1] + $i, $i + $w; $i += $w }
print join "\t", $F[0], $F[1] + $i, $F[2];
' > intervals.bed
ADD REPLY
1
Entering edit mode

Another option:

$ sed 's/^/chr/' human_g1k_v37.bed | sort-bed - | bedops --chop 10000 - > intervals.bed
ADD REPLY
0
Entering edit mode

... which definitely looks much cleaner

ADD REPLY
0
Entering edit mode

thank you it works perfect ly

ADD REPLY

Login before adding your answer.

Traffic: 996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6