Question: Struggling to filter data with R, tidyverse
1
gravatar for davidfarr01
3 months ago by
davidfarr0110
davidfarr0110 wrote:

Question: How do I subset data with the following conditions? The the data must match one column (say, a chromosome), and one other column must fit in a range provided by two columns of another data set?

Background: I have two treatment groups, with columns like chromosome#, start, and stop position. Since the two treatment groups, which are two different data frames, are different size, to be able to cbind them without introducing NA values, I need to try and extract a treatment subset (a total of two) where I only keep the values that are from the same chromosome and overlap somewhere between their start and stop positions in the other chromosome. In other words, I want to get two data sets of the same length where each row is directly comparable within a range position of the other dataset.

An alternative solution could be a way to counter the uneven lengths of the data, without introducing NA's or truncating viable data.

What I've tried so far:

For one data set:

UU_working = subset(UU, UU$start[UU$chromosome == z] %in%
DD$start[DD$chromosome == z]:DD$end[DD$chromosome==z])

But this give the error " numerical expression has 120612 elements: only the first used" and a blank table with only header row.

I also tried using the tidyverse:

DD_filtered = DD %>%
  filter(DD$chromosome==UU$chromosome,
         DD$start >= UU$start,
         DD$end <= UU$end)

But using this I got disturbingly low matches (6 in a genome's worth of methylation frequencies) and gives the following warnings:

Warning messages:
1: In DD$chromosome == UU$chromosome :
  longer object length is not a multiple of shorter object length
2: In DD$start >= UU$start :
  longer object length is not a multiple of shorter object length
3: In DD$end <= UU$end :
  longer object length is not a multiple of shorter object length

as well as this output when switching DD for UU to try and even out the other data set too:

Error in filter_impl(.data, quo) : 
  Result must have length 2069400, not 2264003
In addition: Warning messages:
1: In UU$chromosome == DD$chromosome :
  longer object length is not a multiple of shorter object length
2: In UU$start >= DD$start :
  longer object length is not a multiple of shorter object length
3: In UU$end <= DD$end :
  longer object length is not a multiple of shorter object length

I think this should work to generate some basic sample data:

chrs = c(1:10)
starts = c(1:10)
ends = c(2:11)
uu_sample=data.frame(chrs,starts,ends)

chrs = c(2:12)
starts = c(2:12)
ends = c(4:14)
dd_sample=data.frame(chrs,starts,ends)

The ideal output to play with is two uneven data sets where some, but not all of the observations in the data frame will overlap with observations from the other data frame. The basic logic of the desired command would be: if the chromosome from an observation in uu_sample matches the chromosome observation in dd_sample and the uu_sample start is between the dd_sample start and end then the two observations must overlap - keep them, but filter out the others. This should generate two data sets of even length where each observation overlaps with an observation in the other table (I have the code I need to analyze it as soon as I can get these paired up properly and even).

methylation R genome • 301 views
ADD COMMENTlink modified 3 months ago by zx87546.5k • written 3 months ago by davidfarr0110
2

What you are describing is not "filter" but "merge by overlap".

Read about data.table::foverlap and iranges::findOverlaps, and see this post at SO:

ADD REPLYlink written 3 months ago by zx87546.5k

This one is a little new... I followed the link in the answer using foverlap to try and replicated with the following code:

install.packages("data.table")
require(data.table)
setkey(UU)
DD[, start := end]
foverlaps(DD, UU, by.x = names(DD), type = "within", mult = "all", nomatch = 0L)

But I get this error running the final line

Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ",  : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(x[[xintervals[2L]]], x[[xintervals[1L]]]) :
  ‘-’ not meaningful for factors
ADD REPLYlink modified 3 months ago by Devon Ryan88k • written 3 months ago by davidfarr0110
2

like this?

> sqldf("select * from uu_sample u inner join dd_sample d on u.chrs=d.chrs and u.starts >= d.starts and u.ends <= d.ends")

  chrs starts ends chrs starts ends
1    2      2    3    2      2    4
2    3      3    4    3      3    5
3    4      4    5    4      4    6
4    5      5    6    5      5    7
5    6      6    7    6      6    8
6    7      7    8    7      7    9
7    8      8    9    8      8   10
8    9      9   10    9      9   11
9   10     10   11   10     10   12
ADD REPLYlink modified 3 months ago • written 3 months ago by cpad011211k

Thank you, this did work. It is worth noting that this vs. the foverlap code provided by zx8754 has a striking difference in computing time. Although SQL in usually a really efficient language, it took over 30 minutes (I went to bed) to complete the SELECT statement as for example the actual 'DD' data frame contained millions of observations. The other one completed in less than a minute.

ADD REPLYlink written 3 months ago by davidfarr0110

sure..for certain things data.table is wicked fast.

ADD REPLYlink written 3 months ago by cpad011211k

It would be easier to visualize if you gave us some sample data. Check out this post on how to produce a reproducible example.

ADD REPLYlink written 3 months ago by RamRS20k

I've edited it to add code for a basic pair of data sets. Let me know if this works - the goal was to make sure there was overlap and that they weren't even.

ADD REPLYlink written 3 months ago by davidfarr0110
2
gravatar for zx8754
3 months ago by
zx87546.5k
London
zx87546.5k wrote:

Here is using data.table:foverlap:

# example input
uu_sample <- data.frame(chrs = c(1:10),
                        starts = c(1:10),
                        ends = c(2:11))

dd_sample <- data.frame(chrs = c(2:12),
                        starts = c(2:12),
                        ends = c(4:14))


library(data.table)

# set the keys
setDT(uu_sample, key = c("chrs", "starts", "ends"))
setDT(dd_sample, key = c("chrs", "starts", "ends"))

foverlaps(uu_sample, dd_sample, type = "within", mult = "all", nomatch = 0L)
#    chrs starts ends i.starts i.ends
# 1:    2      2    4        2      3
# 2:    3      3    5        3      4
# 3:    4      4    6        4      5
# 4:    5      5    7        5      6
# 5:    6      6    8        6      7
# 6:    7      7    9        7      8
# 7:    8      8   10        8      9
# 8:    9      9   11        9     10
# 9:   10     10   12       10     11
ADD COMMENTlink written 3 months ago by zx87546.5k

Perfect! This and the SQL query as provided in another response worked, but this one worked in less than a minute. Great!

ADD REPLYlink written 3 months ago by davidfarr0110

The speed was the main reason Matt Dowle (author) started developing his own faster version of dataframes.

ADD REPLYlink written 3 months ago by zx87546.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1149 users visited in the last hour