Common genomic intervals in R
2
0
Entering edit mode
7.6 years ago
viniciushs88 ▴ 50

I would like to infer shared genomic interval between different samples.

My input:

sample    chr start end
NE001      1   100  200
NE001      2   100  200
NE002      1   50   150
NE002      2   50   150
NE003      2   250  300

My expected output:

chr start end  freq
1    100  150   2
2    100  150   2

Where the "freq" is the how many samples have contribuited to infer the shared region. In the above example freq = 2 (NE001 and NE002).

Cheers!

R genome overlap • 3.2k views
0
Entering edit mode

What have you tried so far?

2
Entering edit mode

Actually, I know how to measure the shared numbers between two regions.

input:

start1  end1    start2  end2    bp_overlapped
20    30       25      35          6
25    35       20      30          6
100    190     126     226          65
126    226     100     190          65

Function:

f <- function(x) length( intersect(seq(x[1],x[2],1), seq(x[3],x[4],1)) )

a <- apply(X,1,f)

And I known how to overlap regions and a "cbind" with overlapped ones too. But when all regions are in a same dataframe, and have a collumn tagging different animals... sincerely I don´t know how can I start the logic...

0
Entering edit mode

Cool, we try to ensure that posters have given it a go before providing help. Hopefully my answer (below) will work well for you.

2
Entering edit mode
7.6 years ago

This is a bit easier to do if you use the GenomicRanges package from bioconductor:

library(GenomicRanges)
gr <- GRanges(seqname=c(1,2,1,2,2), range=IRanges(start=c(100,100,50,50,250),
end=c(200,200,150,150,300))) #N.B., You wouldn't do this manually
bins <- disjoin(gr)
mcols(bins)$count <- countOverlaps(bins, gr)  The disjoin() command splits the ranges into no-overlapping segments. Edit: I left off a ) in an earlier version. Mea culpa! ADD COMMENT 1 Entering edit mode 7.6 years ago If you can use system() within R, or if you don't need to use R, here's a way to do it with BEDOPS (i.e., you'd wrap these calls in system() or use the command-line): $ sort-bed inputs.unsorted.bed > inputs.bed
\$ bedops --intersect inputs.bed | bedmap --echo --count - inputs.bed > answer.bed