Merging a bed file with regions delimited by reads
Entering edit mode
4.3 years ago
Jon B • 0


I have a massive file with a lot of regions in bed format. Many of them are overlapping either partially or entirely. I would like to merge these regions, together with their names, in a way that I can see exactly what region (ID'd by the name) contributed to the result.

Using the standard mergeBed doesn't do what I need it to. I can run mergeBed - i bigfile.bed -c1,12 -o count,collapse to get the clustered regions, but that is not what I want. Due to the data I have, there are partial overlaps in many many places and this leads to gargantuan merged intervals that do not specify where each name contributes.

Using this input -

chr1 150 158 A
chr1 155 165 B
chr1 160 170 C
chr1 177 190 D

With the mergeBed I listed above would result in the output -

chr1 150 170 A,B,C 3
chr1 177 190 D, 1

This is unfortunately not what I need.

Ideally, merging the above input would give me the following output (with the 4th column being the name of the region and the 5th being the number of input regions used for calculating the output region) -

chr1 150 154 A 1
chr1 155 158 A,B 2
chr1 159 159 B, 1
chr1 160 165 B,C 2
chr1 166 170 C 1
chr1 177 190 D 1

Note that the regions here are divided by the contributing regions, and you can easily see the original input regions in this result. It is also possible to deduce exactly what region is contributing, such that if you SNPs fall into one of these, you know exactly what is being directly impacted.

Is there any way to do this with existing software?

Thank you in advance

SNP genome mergeBed BEDTools • 965 views
Entering edit mode
4.3 years ago

Sure, use BEDOPS. You can partition intervals with bedops --partition and pipe them to bedmap to do mapping and counting:

$ bedops --partition intervals.bed | bedmap --echo --echo-map-id-uniq --count --delim '\t' --multidelim ',' - intervals.bed > answer.bed

For example:

$ echo -e 'chr1\t100\t200\tA\nchr1\t120\t220\tB\nchr1\t180\t260\tC' > test.bed

Here's what test.bed looks like:

$ less test.bed
chr1    100     200     A
chr1    120     220     B
chr1    180     260     C


$ bedops --partition test.bed | bedmap --echo --echo-map-id-uniq --count --delim '\t' --multidelim ',' - test.bed 
chr1    100     120     A       1
chr1    120     180     A,B     2
chr1    180     200     A,B,C   3
chr1    200     220     B,C     2
chr1    220     260     C       1
Entering edit mode

That worked beautifully, thank you.

Btw, is there an option to make bedmap count from 1 rather than from 0? When I have

chr1 5 10 chr1 6 20

I get

chr1 5 6 chr1 6 10 chr1 10 20

And I would prefer to see

chr1 5 5 chr1 6 10 chr1 11 20

Anyhoo, I've marked your answer as the correct one, thanks again.


Login before adding your answer.

Traffic: 2295 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6