Struggling to filter data with R, tidyverse
1
1
Entering edit mode
5.4 years ago
davidfarr01 ▴ 10

Question: How do I subset data with the following conditions? The the data must match one column (say, a chromosome), and one other column must fit in a range provided by two columns of another data set?

Background: I have two treatment groups, with columns like chromosome#, start, and stop position. Since the two treatment groups, which are two different data frames, are different size, to be able to cbind them without introducing NA values, I need to try and extract a treatment subset (a total of two) where I only keep the values that are from the same chromosome and overlap somewhere between their start and stop positions in the other chromosome. In other words, I want to get two data sets of the same length where each row is directly comparable within a range position of the other dataset.

An alternative solution could be a way to counter the uneven lengths of the data, without introducing NA's or truncating viable data.

What I've tried so far:

For one data set:

UU_working = subset(UU, UU$start[UU$chromosome == z] %in%
DD$start[DD$chromosome == z]:DD$end[DD$chromosome==z])

But this give the error " numerical expression has 120612 elements: only the first used" and a blank table with only header row.

I also tried using the tidyverse:

DD_filtered = DD %>%
  filter(DD$chromosome==UU$chromosome,
         DD$start >= UU$start,
         DD$end <= UU$end)

But using this I got disturbingly low matches (6 in a genome's worth of methylation frequencies) and gives the following warnings:

Warning messages:
1: In DD$chromosome == UU$chromosome :
  longer object length is not a multiple of shorter object length
2: In DD$start >= UU$start :
  longer object length is not a multiple of shorter object length
3: In DD$end <= UU$end :
  longer object length is not a multiple of shorter object length

as well as this output when switching DD for UU to try and even out the other data set too:

Error in filter_impl(.data, quo) : 
  Result must have length 2069400, not 2264003
In addition: Warning messages:
1: In UU$chromosome == DD$chromosome :
  longer object length is not a multiple of shorter object length
2: In UU$start >= DD$start :
  longer object length is not a multiple of shorter object length
3: In UU$end <= DD$end :
  longer object length is not a multiple of shorter object length

I think this should work to generate some basic sample data:

chrs = c(1:10)
starts = c(1:10)
ends = c(2:11)
uu_sample=data.frame(chrs,starts,ends)

chrs = c(2:12)
starts = c(2:12)
ends = c(4:14)
dd_sample=data.frame(chrs,starts,ends)

The ideal output to play with is two uneven data sets where some, but not all of the observations in the data frame will overlap with observations from the other data frame. The basic logic of the desired command would be: if the chromosome from an observation in uu_sample matches the chromosome observation in dd_sample and the uu_sample start is between the dd_sample start and end then the two observations must overlap - keep them, but filter out the others. This should generate two data sets of even length where each observation overlaps with an observation in the other table (I have the code I need to analyze it as soon as I can get these paired up properly and even).

R methylation genome • 5.3k views
ADD COMMENT
2
Entering edit mode

What you are describing is not "filter" but "merge by overlap".

Read about data.table::foverlap and iranges::findOverlaps, and see this post at SO:

ADD REPLY
0
Entering edit mode

This one is a little new... I followed the link in the answer using foverlap to try and replicated with the following code:

install.packages("data.table")
require(data.table)
setkey(UU)
DD[, start := end]
foverlaps(DD, UU, by.x = names(DD), type = "within", mult = "all", nomatch = 0L)

But I get this error running the final line

Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ",  : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(x[[xintervals[2L]]], x[[xintervals[1L]]]) :
  ‘-’ not meaningful for factors
ADD REPLY
2
Entering edit mode

like this?

> sqldf("select * from uu_sample u inner join dd_sample d on u.chrs=d.chrs and u.starts >= d.starts and u.ends <= d.ends")

  chrs starts ends chrs starts ends
1    2      2    3    2      2    4
2    3      3    4    3      3    5
3    4      4    5    4      4    6
4    5      5    6    5      5    7
5    6      6    7    6      6    8
6    7      7    8    7      7    9
7    8      8    9    8      8   10
8    9      9   10    9      9   11
9   10     10   11   10     10   12
ADD REPLY
0
Entering edit mode

Thank you, this did work. It is worth noting that this vs. the foverlap code provided by zx8754 has a striking difference in computing time. Although SQL in usually a really efficient language, it took over 30 minutes (I went to bed) to complete the SELECT statement as for example the actual 'DD' data frame contained millions of observations. The other one completed in less than a minute.

ADD REPLY
0
Entering edit mode

sure..for certain things data.table is wicked fast.

ADD REPLY
0
Entering edit mode

It would be easier to visualize if you gave us some sample data. Check out this post on how to produce a reproducible example.

ADD REPLY
0
Entering edit mode

I've edited it to add code for a basic pair of data sets. Let me know if this works - the goal was to make sure there was overlap and that they weren't even.

ADD REPLY
2
Entering edit mode
5.4 years ago
zx8754 11k

Here is using data.table:foverlap:

# example input
uu_sample <- data.frame(chrs = c(1:10),
                        starts = c(1:10),
                        ends = c(2:11))

dd_sample <- data.frame(chrs = c(2:12),
                        starts = c(2:12),
                        ends = c(4:14))


library(data.table)

# set the keys
setDT(uu_sample, key = c("chrs", "starts", "ends"))
setDT(dd_sample, key = c("chrs", "starts", "ends"))

foverlaps(uu_sample, dd_sample, type = "within", mult = "all", nomatch = 0L)
#    chrs starts ends i.starts i.ends
# 1:    2      2    4        2      3
# 2:    3      3    5        3      4
# 3:    4      4    6        4      5
# 4:    5      5    7        5      6
# 5:    6      6    8        6      7
# 6:    7      7    9        7      8
# 7:    8      8   10        8      9
# 8:    9      9   11        9     10
# 9:   10     10   12       10     11
ADD COMMENT
0
Entering edit mode

Perfect! This and the SQL query as provided in another response worked, but this one worked in less than a minute. Great!

ADD REPLY
0
Entering edit mode

The speed was the main reason Matt Dowle (author) started developing his own faster version of dataframes.

ADD REPLY

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6