Question: Random Sampling And Comparing Objects N Times: R
0
6.6 years ago by
hessjl70
United States
hessjl70 wrote:

New to R.

Problem: I have created objects `a` and `b`, a contains 1,200,000 rows and `b` contains 5000 rows. I want to repeatedly sample 1000 rows from `a` and compare the items in `a` to those items in `b` to find out how many overlapping items exist.

All I know is how to create an object:

``````a <- c(1,2,3,....L)
b <- c(1,2,3,....L)
``````

...and, randomly sample `a`:

``````c <- sample(a, 1000)
``````

However, I don't know how to compare `a` and `b` to determine the number of overlapping items. My attempt at this with a simple if, then statement returned an error:

``````if (c==b) "TRUE" else "FALSE"
[1] "FALSE"
Warning messages:
1: In c == b :
longer object length is not a multiple of shorter object length
2: In if (c == b) "TRUE" else "FALSE" :
the condition has length > 1 and only the first element will be used
``````

First problem, I need to know how to identify overlapping items between objects. Secondly, I need to repeat this multiple times (~10,000), perhaps through a for() loop?

Any insight will be much appreciated.

R random • 4.2k views
modified 5.7 years ago by Biostar ♦♦ 20 • written 6.6 years ago by hessjl70

Are `a` and `b` just numbers or are they genomic positions (e.g. from a `GRanges` object)? The former case is quite simple:

``````table(c %in% b)
``````

The latter case will depend on the object, though many objects allow intersects and comparisons. BTW, avoid `for` loops in R, they're really slow.

4
6.6 years ago by
zx87549.7k
London
zx87549.7k wrote:

It is helpful if you give us context, how is this related to bioinformatics, what are these numbers? Otherwise, it is just another programming question more suitable for stackoverflow.

Here is my try:

``````#dummy data
# a is a bucket of numbers
a <- c(1:100)
# b is a small selection of numbers
b <- round(runif(10,1,100))

# do 10 times: if we pick randomly 10 numbers from bucket
#              how many overlap with the b
output <- sapply(1:10,
function(TrialCount)
c(TrialCount,length(intersect(sample(a,10), b))))

# pretty output
output <- as.data.frame(t(output))
colnames(output) <- c("Trial(n)","Intersections(y)")
``````

The purpose is to test the proportion of mutations expected to be identified as cis-regulating elements from a random sample (i.e., a simple index of CREs expected to be encountered by chance search).
Thanks for your insight, I will give this dummy script a trial run.

3
6.6 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

This is clearly a problem on sets. You just need to count over the intersection of the two sets. Try these functions:

sets {base}

Simple example:

``````a <- c(1:10)
b <- c(5:15)
a
[1]  1  2  3  4  5  6  7  8  9 10
b
[1]  5  6  7  8  9 10 11 12 13 14 15
intersect(a,b)
[1]  5  6  7  8  9 10
#Looping part
for(i in 1:10){
c <- sample(a, 5)
intersect(c, b)
}
``````

Now, it's up to you adapt the code. I suggest you to read about apply functions. They are quite handy in situations like yours. If your elements are intervals of coordinates you could use findInterval.

I think that's a fair solution for solving the problem related to overlapping elements, but there's a critical gap remaining. I intend to perform this "intersection" on only a sample of `a` items, yet need to do so over and over. I'm not concerned with the knowing the absolute intersection, only the intersection from smaller samples of `a` to the whole `b`.
How might I perform this iteratively with a simple loop, and ask to write the ouput into a text file formatted as follows:

Trial (n) Intersections (y)

[1] 500

[2] 154

[3] 10

[4] 95

.

.

.

[10000] 299