Question: Group Reads In Bed File Based On The Distance
2
gravatar for k.nirmalraman
6.8 years ago by
k.nirmalraman1000
Germany
k.nirmalraman1000 wrote:

Is there any tool that can group the reads in a bed file format that are few bases close to each other.

For instance,

Chr1   1023080  1023114 XYZ 806 +
Chr1   1023081  1023115 XYZ 50  +
Chr1   1023083  1023117 ABC 3   +
Chr1   1023085  1023119 cbd 90 +

I would like to group if the reads are atleast 4 bases close to each other... and report the results as follows

Chr1  1023080   1023117  xyz 859 +
Chr1   1023085  1023119 cbd 90 +

Is it possible using bedtools of awk in a short script? The name column doesnt really matter.. Any directions would be most welcome.. Thanks!!

bed • 2.0k views
ADD COMMENTlink modified 6.8 years ago by sjneph600 • written 6.8 years ago by k.nirmalraman1000

I don't follow your example. 1) why is the second record in your output distinct from the first? 2) why are the coordinates of the first output record not Chr1:1023080-1023117

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Aaronquinlan11k

Yes, the first record should be 1023117... that was a typo mistake! The last record in fle is 4 bases away, I would like to group only the reads that fall within four bases range...

Hope I answered your question!

ADD REPLYlink written 6.8 years ago by k.nirmalraman1000
3
gravatar for KCC
6.8 years ago by
KCC4.0k
Cambridge, MA
KCC4.0k wrote:

Use bedtools:

mergeBed -i input.bed -d 4 -nms > output.bed

The expression "-d 4" means merge areas that are less than 4 bp apart. Also "-nms" gives you the names of what was merged separated by a semicolon. This isn't everything you want, but a simple perl or python script should give you the rest.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by KCC4.0k

Unfortunately, this won't do quite was the OP wants. The merge function looks for intervals that are within -d of the current merged block, not the start position of the first interval that started the block, as the OP wants. I think this is the same for both bedtools and bedops...

ADD REPLYlink written 6.8 years ago by Aaronquinlan11k
3
gravatar for sjneph
6.8 years ago by
sjneph600
sjneph600 wrote:

You can do this easily with the tools in BEDOPS.


bedops --range 2 --merge your-bed \
  | bedops --range -2 -u - \
  | bedmap --echo --echo-map-id --sum --prec 0 --delim "\t" - your-bed

This will merge coordinates regions within 4 bp of each other (--range 2 symmetrically pads all BED coordinates by 2 bp). Then, we undo the extra bp on the ends of merged regions with a second bedops call. Finally, the bedmap statement returns a list of ids that overlap each merged region (notice above that your result only shows XYZ whereas the actual merged region overlaps XYZ and ABC regions; the --echo-map-id flag will return both to you as a semicolon separated list). And the --sum option will return the sum of scores over all merged entries. I also included --prec 0 since your example shows integers in the 5th column. Just drop that option if you want floating-point sums. Finally, the --delim "\t" just says to use tabs in between the columns of results just mentioned (by default, the '|' character is used).

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by sjneph600

Thank you.. Saved a lot of time!! :)

ADD REPLYlink written 6.8 years ago by k.nirmalraman1000
2

You're welcome. I should have mentioned that your-bed should be sorted with sort-bed before you run through this.


  sort-bed your-bed > your-sorted-bed

Then, just substitute your-sorted-bed for 'your-bed' above. BEDOPS has strict sorting requirements so that things run faster and with less memory. The outputs from these tools are sorted though, so you never need to sort results before passing further downstream to other analyses.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by sjneph600
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1872 users visited in the last hour