Hi, I'm using Bedtools for R (bedr) in Ubuntu and I am trying to merge genomic regions that are within 100 bp of one another. For example, I have a data frame (df) with 3274 obs of 3 variables and the first rows look like:
dfchr | start | end |
---|---|---|
chr1 | 1214882 | 1214884 |
chr1 | 1214942 | 1214944 |
chr1 | 1215030 | 1215032 |
chr1 | 1215036 | 1215038 |
and when I merge using:
> df1 <- bedr.merge.region(df.sorted, distance = 100, number = TRUE, check.zero.based = TRUE, check.chr = TRUE, check.valid = TRUE, check.sort = TRUE)
I get the data frame (df1):
df1chr | start | end | V4 |
---|---|---|---|
chr1 | 1214882 | 1215038 | 4 |
My goal is to merge all coordinates that are within 100 bp (distance=100) so from data frame df it should've merged the 2 first rows together and then the 2 last together since there's less than 100 bp between the start of row 1 (on df) and end of row 2 (on df), not the 4 (as if shows in df1), since that gives a distance of 156 bp (1215038 - 1214882 = 156),
Any help as to why the parameter "distance = 100" is not merging only regions within 100 bp and it merges regions at 156 bp? The goal is to be able to design probes for wet lab to capture regions of interest but our probes are limited to 100 bp so I want to see how many probes of 100 bp I would need to build to capture all regions and what would their coordinates be.
Thank you Joana
To clarify, my expected output from the code is (df2)
df2chr | start | end | V4 |
---|---|---|---|
chr1 | 1214882 | 1214944 | 2 |
chr1 | 1215030 | 1215038 | 2 |
If you need to merge one element at a time, you might use a Python or
awk
script to read in one element at a time, storing the first element. Progressively merge when subsequent element's end positions are within X bases of the first element's start position. When that distance test fails, then you print the range of the merged elements, and reset the "first" element. Repeat this test as you iterate through the rest of the elements. This is a fairly basic scripting exercise.Hi Alex,
Thank you for your reply. I figured it could be something simple to write even though my experience in Python or R is very basic. So I figured if this function existed already or there was a parameter for the 'merge' function in Bedtools that I could use for this purpose, I could just plug my data and get the probes for the wet lab portion, and then revisit the issue as a scripting exercise for my Python course ... Thank you!