Question

Random Sampling And Comparing Objects N Times: R

0

Entering edit mode

10.1 years ago

hessjl ▴ 90

New to R.

Problem: I have created objects a and b, a contains 1,200,000 rows and b contains 5000 rows. I want to repeatedly sample 1000 rows from a and compare the items in a to those items in b to find out how many overlapping items exist.

All I know is how to create an object:

a <- c(1,2,3,....L)
b <- c(1,2,3,....L)

...and, randomly sample a:

c <- sample(a, 1000)

However, I don't know how to compare a and b to determine the number of overlapping items. My attempt at this with a simple if, then statement returned an error:

if (c==b) "TRUE" else "FALSE"
[1] "FALSE"
Warning messages:
1: In c == b :
  longer object length is not a multiple of shorter object length
2: In if (c == b) "TRUE" else "FALSE" :
  the condition has length > 1 and only the first element will be used

First problem, I need to know how to identify overlapping items between objects. Secondly, I need to repeat this multiple times (~10,000), perhaps through a for() loop?

Any insight will be much appreciated.

random r • 6.4k views

ADD COMMENT • link updated 9.2 years ago by Biostar 20 • written 10.1 years ago by hessjl ▴ 90

0

Entering edit mode

Are a and b just numbers or are they genomic positions (e.g. from a GRanges object)? The former case is quite simple:

table(c %in% b)

The latter case will depend on the object, though many objects allow intersects and comparisons. BTW, avoid for loops in R, they're really slow.

ADD REPLY • link 10.1 years ago by Devon Ryan 104k

score 4 · Answer 1 · 2014-02-28

4

Entering edit mode

10.1 years ago

Jarretinha 3.4k

This is clearly a problem on sets. You just need to count over the intersection of the two sets. Try these functions:

sets {base}

Simple example:

a <- c(1:10)
b <- c(5:15)
a
[1]  1  2  3  4  5  6  7  8  9 10
b
[1]  5  6  7  8  9 10 11 12 13 14 15
intersect(a,b)
[1]  5  6  7  8  9 10
#Looping part
for(i in 1:10){
    c <- sample(a, 5)
    intersect(c, b)
}

Now, it's up to you adapt the code. I suggest you to read about apply functions. They are quite handy in situations like yours. If your elements are intervals of coordinates you could use findInterval.

ADD COMMENT • link 10.1 years ago by Jarretinha 3.4k

0

Entering edit mode

I think that's a fair solution for solving the problem related to overlapping elements, but there's a critical gap remaining. I intend to perform this "intersection" on only a sample of a items, yet need to do so over and over. I'm not concerned with the knowing the absolute intersection, only the intersection from smaller samples of a to the whole b.
How might I perform this iteratively with a simple loop, and ask to write the ouput into a text file formatted as follows:

Trial (n) Intersections (y)

[1] 500

[2] 154

[3] 10

[4] 95

.

[10000] 299

ADD REPLY • link 10.1 years ago by hessjl ▴ 90

score 4 · Answer 2 · 2014-02-28

4

Entering edit mode

10.1 years ago

zx8754 11k

It is helpful if you give us context, how is this related to bioinformatics, what are these numbers? Otherwise, it is just another programming question more suitable for stackoverflow.

Here is my try:

#dummy data
# a is a bucket of numbers
a <- c(1:100)
# b is a small selection of numbers
b <- round(runif(10,1,100))

# do 10 times: if we pick randomly 10 numbers from bucket
#              how many overlap with the b
output <- sapply(1:10,
                 function(TrialCount)
                   c(TrialCount,length(intersect(sample(a,10), b))))

# pretty output
output <- as.data.frame(t(output))
colnames(output) <- c("Trial(n)","Intersections(y)")

ADD COMMENT • link 10.1 years ago by zx8754 11k

0

Entering edit mode

The purpose is to test the proportion of mutations expected to be identified as cis-regulating elements from a random sample (i.e., a simple index of CREs expected to be encountered by chance search).
Thanks for your insight, I will give this dummy script a trial run.

ADD REPLY • link 10.1 years ago by hessjl ▴ 90