Question

Group Reads In Bed File Based On The Distance

2

Entering edit mode

11.2 years ago

k.nirmalraman ★ 1.1k

Is there any tool that can group the reads in a bed file format that are few bases close to each other.

For instance,

Chr1   1023080  1023114 XYZ 806 +
Chr1   1023081  1023115 XYZ 50  +
Chr1   1023083  1023117 ABC 3   +
Chr1   1023085  1023119 cbd 90 +

I would like to group if the reads are atleast 4 bases close to each other... and report the results as follows

Chr1  1023080   1023117  xyz 859 +
Chr1   1023085  1023119 cbd 90 +

Is it possible using bedtools of awk in a short script? The name column doesnt really matter.. Any directions would be most welcome.. Thanks!!

bed • 3.5k views

ADD COMMENT • link updated 11.2 years ago by sjneph ▴ 690 • written 11.2 years ago by k.nirmalraman ★ 1.1k

0

Entering edit mode

I don't follow your example. 1) why is the second record in your output distinct from the first? 2) why are the coordinates of the first output record not Chr1:1023080-1023117

ADD REPLY • link 11.2 years ago by Aaronquinlan 12k

0

Entering edit mode

Yes, the first record should be 1023117... that was a typo mistake! The last record in fle is 4 bases away, I would like to group only the reads that fall within four bases range...

Hope I answered your question!

ADD REPLY • link 11.2 years ago by k.nirmalraman ★ 1.1k

score 3 · Answer 1 · 2013-02-05

3

Entering edit mode

11.2 years ago

KCC ★ 4.1k

Use bedtools:

mergeBed -i input.bed -d 4 -nms > output.bed

The expression "-d 4" means merge areas that are less than 4 bp apart. Also "-nms" gives you the names of what was merged separated by a semicolon. This isn't everything you want, but a simple perl or python script should give you the rest.

ADD COMMENT • link 11.2 years ago by KCC ★ 4.1k

0

Entering edit mode

Unfortunately, this won't do quite was the OP wants. The merge function looks for intervals that are within -d of the current merged block, not the start position of the first interval that started the block, as the OP wants. I think this is the same for both bedtools and bedops...

ADD REPLY • link 11.2 years ago by Aaronquinlan 12k

score 3 · Answer 2 · 2013-02-06

3

Entering edit mode

11.2 years ago

sjneph ▴ 690

You can do this easily with the tools in BEDOPS.


bedops --range 2 --merge your-bed \
  | bedops --range -2 -u - \
  | bedmap --echo --echo-map-id --sum --prec 0 --delim "\t" - your-bed

This will merge coordinates regions within 4 bp of each other (--range 2 symmetrically pads all BED coordinates by 2 bp). Then, we undo the extra bp on the ends of merged regions with a second bedops call. Finally, the bedmap statement returns a list of ids that overlap each merged region (notice above that your result only shows XYZ whereas the actual merged region overlaps XYZ and ABC regions; the --echo-map-id flag will return both to you as a semicolon separated list). And the --sum option will return the sum of scores over all merged entries. I also included --prec 0 since your example shows integers in the 5th column. Just drop that option if you want floating-point sums. Finally, the --delim "\t" just says to use tabs in between the columns of results just mentioned (by default, the '|' character is used).

ADD COMMENT • link 11.2 years ago by sjneph ▴ 690

0

Entering edit mode

Thank you.. Saved a lot of time!! :)

ADD REPLY • link 11.2 years ago by k.nirmalraman ★ 1.1k

2

Entering edit mode

You're welcome. I should have mentioned that your-bed should be sorted with sort-bed before you run through this.


  sort-bed your-bed > your-sorted-bed

Then, just substitute your-sorted-bed for 'your-bed' above. BEDOPS has strict sorting requirements so that things run faster and with less memory. The outputs from these tools are sorted though, so you never need to sort results before passing further downstream to other analyses.

ADD REPLY • link 11.2 years ago by sjneph ▴ 690