Question: How do I subset data with the following conditions? The the data must match one column (say, a chromosome), and one other column must fit in a range provided by two columns of another data set?
Background: I have two treatment groups, with columns like chromosome#, start, and stop position. Since the two treatment groups, which are two different data frames, are different size, to be able to cbind them without introducing NA values, I need to try and extract a treatment subset (a total of two) where I only keep the values that are from the same chromosome and overlap somewhere between their start and stop positions in the other chromosome. In other words, I want to get two data sets of the same length where each row is directly comparable within a range position of the other dataset.
An alternative solution could be a way to counter the uneven lengths of the data, without introducing NA's or truncating viable data.
What I've tried so far:
For one data set:
UU_working = subset(UU, UU$start[UU$chromosome == z] %in% DD$start[DD$chromosome == z]:DD$end[DD$chromosome==z])
But this give the error " numerical expression has 120612 elements: only the first used" and a blank table with only header row.
I also tried using the tidyverse:
DD_filtered = DD %>% filter(DD$chromosome==UU$chromosome, DD$start >= UU$start, DD$end <= UU$end)
But using this I got disturbingly low matches (6 in a genome's worth of methylation frequencies) and gives the following warnings:
Warning messages: 1: In DD$chromosome == UU$chromosome : longer object length is not a multiple of shorter object length 2: In DD$start >= UU$start : longer object length is not a multiple of shorter object length 3: In DD$end <= UU$end : longer object length is not a multiple of shorter object length
as well as this output when switching DD for UU to try and even out the other data set too:
Error in filter_impl(.data, quo) : Result must have length 2069400, not 2264003 In addition: Warning messages: 1: In UU$chromosome == DD$chromosome : longer object length is not a multiple of shorter object length 2: In UU$start >= DD$start : longer object length is not a multiple of shorter object length 3: In UU$end <= DD$end : longer object length is not a multiple of shorter object length
I think this should work to generate some basic sample data:
chrs = c(1:10) starts = c(1:10) ends = c(2:11) uu_sample=data.frame(chrs,starts,ends) chrs = c(2:12) starts = c(2:12) ends = c(4:14) dd_sample=data.frame(chrs,starts,ends)
The ideal output to play with is two uneven data sets where some, but not all of the observations in the data frame will overlap with observations from the other data frame. The basic logic of the desired command would be: if the chromosome from an observation in uu_sample matches the chromosome observation in dd_sample and the uu_sample start is between the dd_sample start and end then the two observations must overlap - keep them, but filter out the others. This should generate two data sets of even length where each observation overlaps with an observation in the other table (I have the code I need to analyze it as soon as I can get these paired up properly and even).