3.2 years ago
User 7754 ▴ 250

Hi,

I have a dataset with two sets of ranges (.df1 and .df2) and I am trying to find a common set of ranges across both simultaneously. Meaning, I would like to reduce the ranges only if they overlap both sequences in a row.

 df = data.frame(seqnames.df1 = c("chr1", "chr1", "chr1"), start.df1 = c(1,3,3), end.df1 = c(4,8,14),
seqnames.df2 = c("chr1", "chr1", "chr1"), start.df2 = c(2,8,4), end.df2 = c(5,11,17),
score1=c(0,1,2), score2=c(2,3,4),score3=c(3,4,5), score4=c(5,6,7))


Desired output:

out = data.frame(seqnames.df1 = c("chr1", "chr1"), start.df1 = c(1,3), end.df1 = c(14,8), seqnames.df2 = c("chr1", "chr1"), start.df2 = c(2,8), end.df2 = c(17,11),
score1=c(1,2), score2=c(3,4),score3=c(4,5), score4=c(6,7))


rows 1 and 3 get reduced to the union of the ranges, because both ranges from .df1 and .df2 overlap. row 2 does not get reduced because, although the first ranges from .df1 overlap with other ranges, the second ranges from .df2 do not (the max score is kept)

Is there a clever way of doing this with GenomicRanges or other packages? I am struggling in finding a good approach, for now I am creating the reduced data for each of the ranges, and then looking for overlaps. Is this the right direction?

This is what I have until now:

library(GenomicRanges)

df1.gr = makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df1, start=df$start.df1, end=df$end.df1)) df2.gr = makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df2, start=df$start.df2, end=df$end.df2))

df1.main.gr = reduce(makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df1, start=df$start.df1, end=df$end.df1))) df2.main.gr = reduce(makeGRangesFromDataFrame(data.frame(chrom=df$seqnames.df2, start=df$start.df2, end=df$end.df2)))

hits1 = findOverlapsdf1.gr, df1.main.gr)
hits2 = findOverlapsdf2.gr, df2.main.gr)


Thank you for any suggestions!

Can you set up your inputs as BED files (BED5)? I'd like to try to help, but I loaded your data frames into R and still don't really understand what you're trying to do. Some concrete input and output would help.

Yes, I was also looking in R last night, but I realised that it was not clear what the OP wanted. It looks like it would be easier outside of R, in addition.

Hi, Thank you for looking into helping.

I think R is the most appropriate because of the "GenomicRanges" package, the set up of the problem would be similarly hard with bedtools, unless there is a specific function for this situation. What I am trying to do is merge the overlaps across rows only if both the columns overlap.

So, taking for example row 1 and row 2, merge the rows from the two ranges only if

[ df$seqnames.df1, start=df$start.df1, end=df$end.df1]  from row 1 and row 2 overlap, AND [df$seqnames.df2, start=df$start.df2, end=df$end.df2) ]


from row 1 and row 2 overlap. If for example only one of these overlaps, then we don't reduce the ranges in these rows:

[ df$seqnames.df1, start=df$start.df1, end=df$end.df1 ]  overlap, AND [df$seqnames.df2, start=df$start.df2, end=df$end.df2) ]


does not overlap, leave the rows as independent, as they are.

Meaning, I would like to reduce the ranges only if they overlap both sequences in a row.

3.2 years ago

If you just want ranges (no metadata), here is one approach using Unix streams and BEDOPS tools.

Put your files into sorted BED format:

$sort-bed A.unsorted.bed > A.bed$ sort-bed B.unsorted.bed > B.bed


Separate out elements which overlap by at least one base in both sets, and merge their genomic space:

$bedops --merge <(bedops --element-of 1 A.bed B.bed) <(bedops --element-of 1 B.bed A.bed) > merge.bed  Take the union of the set of elements which do not overlap, and cut out everything but genomic space: $ bedops --everything <(bedops --not-element-of 1 A.bed B.bed) <(bedops --not-element-of 1 B.bed A.bed) | cut -f1-3 > disjoint.bed


Take the union of the merged and disjoined space:

\$ bedops --everything merge.bed disjoint.bed > answer.bed


The file answer.bed should have your ranges.

Thank you Alex.

A.bed B.bed

chr1    1       4                                          chr1    2       5
chr1    3       8                                          chr1    8       11
chr1    3       14                                         chr1    4       17


but then this gives me only the overall merged ("chr1:1:17")? I was thinking the solution to my problem could be to use findOverlaps by row, to find if both overlaps are true. I am still not sure how to best approach this....

0
Every element in those two sets overlaps.

3.2 years ago

df1_GR <- makeGRangesFromDataFrame(
df,
seqnames="seqnames.df1",
start.field="start.df1",
end.field="end.df1",
keep.extra.columns=TRUE)

df2_GR <- makeGRangesFromDataFrame(
df,
seqnames="seqnames.df2",
start.field="start.df2",
end.field="end.df2",
keep.extra.columns=TRUE)

# find row indices where df1_GR and df2_GR rows overlap by >1 base position
for (i in 1:nrow(df))
{
if (length(queryHits(findOverlaps(df1_GR[i,], df2_GR[i,], type="any", minoverlap=2)))) {
indicesOverlapping <- c(indicesOverlapping, i)
}
}

# reduce / collapse segments where rows have matched
final <- data.frame(
reduce(df1_GR[indicesOverlapping,]),
reduce(df2_GR[indicesOverlapping,]))

colnames(final) <- c(
"seqnames.df1", "start.df1", "end.df1", "width.df1", "strand.df1",
"seqnames.df2", "start.df2", "end.df2", "width.df2", "strand.df2")

final <- final[,-which(colnames(final) %in% c("width.df1", "strand.df1", "width.df2", "strand.df2"))]

# fina lresult is reduced segments + those rows that did not originally match
final <- rbind(
final,
df[-indicesOverlapping, c("seqnames.df1", "start.df1", "end.df1", "seqnames.df2", "start.df2", "end.df2")]
)

final

seqnames.df1 start.df1 end.df1 seqnames.df2 start.df2 end.df2
1         chr1         1      14         chr1         2      17
2         chr1         3       8         chr1         8      11


This obviously doesn't include the scores, but the general structure is there [I believe] for doing what you need.

Yes!! This is exactly what I have been trying to do but kept getting stuck at the reduce part! Brilliant. thanks so much!!

Adding scores is simple with this set up!

indicesOverlapping = c()
scores = c()
for (i in 1:nrow(df))
{
if (length(queryHits(findOverlaps(df1_GR[i,], df2_GR[i,], type="any", minoverlap=2)))) {
indicesOverlapping <- c(indicesOverlapping, i)
# choose scores from the first overlaps
scores = df[indicesOverlapping[1], c("score1", "score2", "score3", "score4")]
}
}


....Final result is reduced segments + those rows that did not originally match

final <- rbind(
c(final, scores),
df[-indicesOverlapping, c("seqnames.df1", "start.df1", "end.df1", "seqnames.df2", "start.df2", "end.df2", "score1", "score2", "score3", "score4")]
)

Great - you're welcome