Question: How To Intersect A Range With Single Positions
3
gravatar for roll
5.5 years ago by
roll290
United Kingdom
roll290 wrote:

Hi,

Is there any tool that will find whether a position fall into a certain region other than awk or bedtools intersect? That is, lets's assume i have a range with position

chr1:10000-20000

and i have a list of positions

chr1 100 1 2 3
chr1 12000 1 2 3
chr1 15000 1 2 3
chr1 25000 1 2 3

So that it will catch chr1 12000 1 2 3 and chr1 15000 1 2 3

I am trying this with awk but it is too slow considering i have lots of positions and ranges. I thought about bedtools intersect but then again, i have some extra information that i dont know how to keep that extra information if i convert the single positions into bed format.

Is there any tools that does this faster than awk?

bed intersect • 3.1k views
ADD COMMENTlink modified 5.5 years ago by Neilfws48k • written 5.5 years ago by roll290

If you want to keep that information in a BED file, you can just separate the last 3 columns by commas and make them the 4th column. You'll have to reparse the output, of course.

ADD REPLYlink written 5.5 years ago by Devon Ryan91k

OR write a short script that will merge the new intersect file with the original file with the extra information.

ADD REPLYlink written 5.5 years ago by Ashutosh Pandey11k
4
gravatar for dariober
5.5 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Convert your positions to bed as you said and use intersectBed to give all the columns in your a and b files with options -wa and -wb (or I miss something?):

cat region.bed 
chr1    10000    20000

cat pos.txt 
chr1    100    1    2    3
chr1    12000    1    2    3
chr1    15000    1    2    3
chr1    25000    1    2    3

 awk 'BEGIN{OFS="\t"} {print $1, $2-1, $2, $3, $4, $5}' pos.txt \
    | intersectBed -a - -b region.bed -wa -wb

chr1    11999    12000    1    2    3    chr1    10000    20000
chr1    14999    15000    1    2    3    chr1    10000    20000

Hope this helps! Dario

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by dariober10k

Great. Thanks for the info. It has been a while since I have updated myself :-)

ADD REPLYlink written 5.5 years ago by Ashutosh Pandey11k

I've been trying to use this example for my situation and it prints to the screen well but when I try to output it to a file, it doesn't work. This is what I've tried: awk 'BEGIN{OFS="\t"} {print $1, $2-1,$2, $3, $4, $5, $6, $7, $8, $9, $10}' Nonsensetranslated.txt \ | intersectBed -a - -b final_starts_stops.bed -wa -wb > done.txt

My error informs me:

awk: can't open file  

 input record number 15577, file  

 source line number 1

ADD REPLYlink written 4.2 years ago by christylynn0020
1
gravatar for Alex Reynolds
5.5 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Using BEDOPS bedmap might be helpful, to instead get an answer in a more concise range-to-mapped-positions format:

$ sort-bed ranges.unsorted.bed > ranges.bed
$ awk '{print $1"\t"$2"\t"$2+1"\t"$3"\t"$4"\t"$5}' positions.txt \
    | sort-bed - \ 
    | bedmap --echo --echo-map ranges.bed - \
    > answer.bed

Results will show each range, along with any mapped positions that overlap that range:

chr1   10000    20000 | chr1   12000 12001  1  2  3;  chr1   15000 15001  1  2  3

You can strip the extra column out with downstream tools (another awk statement, for instance).

ADD COMMENTlink written 5.5 years ago by Alex Reynolds28k
1
gravatar for Neilfws
5.5 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Another solution: the findOverlaps() method from the R package IRanges. It's very fast, once you get everything into the right format and figure out the R syntax.

Some example code, using your example data:

library(IRanges)
# first, create data frames
df1 <- data.frame(chr = "chr1", start = 10000, end = 20000)
df2 <- data.frame(chr = rep("chr1", 4), posn = c(100, 12000, 15000, 250000), x = rep(1, 4), y = rep(2, 4), z = rep(3, 4))
# next, create IRanges objects
df1.ir <- IRanges(start = df1$start, end = df1$end, names = df1$chr)
df2.ir <- IRanges(start = df2$posn, end = df2$posn, names = df2$chr)
# convert the "subject" to an interval tree (much faster)
df1.tree <- IntervalTreedf1.ir)
# now find the overlaps
o <- findOverlapsdf2.ir, df1.tree, type = "within")
df2[o@queryHits,]
#       chr  posn x y z
#   2  chr1 12000 1 2 3
#   3  chr1 15000 1 2 3
ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Neilfws48k

Just adding on to this answer. Please see here. Basically its better to use GRanges when chromosomal context is required.

ADD REPLYlink written 8 months ago by satyanarayan.iiitm0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 989 users visited in the last hour