Question: Random Sampling And Comparing Objects N Times: R
0
gravatar for hessjl
5.5 years ago by
hessjl50
United States
hessjl50 wrote:

New to R.

Problem: I have created objects a and b, a contains 1,200,000 rows and b contains 5000 rows. I want to repeatedly sample 1000 rows from a and compare the items in a to those items in b to find out how many overlapping items exist.

All I know is how to create an object:

a <- c(1,2,3,....L)
b <- c(1,2,3,....L)

...and, randomly sample a:

c <- sample(a, 1000)

However, I don't know how to compare a and b to determine the number of overlapping items. My attempt at this with a simple if, then statement returned an error:

if (c==b) "TRUE" else "FALSE"
[1] "FALSE"
Warning messages:
1: In c == b :
  longer object length is not a multiple of shorter object length
2: In if (c == b) "TRUE" else "FALSE" :
  the condition has length > 1 and only the first element will be used

First problem, I need to know how to identify overlapping items between objects. Secondly, I need to repeat this multiple times (~10,000), perhaps through a for() loop?

Any insight will be much appreciated.

R random • 3.3k views
ADD COMMENTlink modified 4.5 years ago by Biostar ♦♦ 20 • written 5.5 years ago by hessjl50

Are a and b just numbers or are they genomic positions (e.g. from a GRanges object)? The former case is quite simple:

table(c %in% b)

The latter case will depend on the object, though many objects allow intersects and comparisons. BTW, avoid for loops in R, they're really slow.

ADD REPLYlink written 5.5 years ago by Devon Ryan91k
4
gravatar for zx8754
5.5 years ago by
zx87547.9k
London
zx87547.9k wrote:

It is helpful if you give us context, how is this related to bioinformatics, what are these numbers? Otherwise, it is just another programming question more suitable for stackoverflow.

Here is my try:

#dummy data
# a is a bucket of numbers
a <- c(1:100)
# b is a small selection of numbers
b <- round(runif(10,1,100))

# do 10 times: if we pick randomly 10 numbers from bucket
#              how many overlap with the b
output <- sapply(1:10,
                 function(TrialCount)
                   c(TrialCount,length(intersect(sample(a,10), b))))

# pretty output
output <- as.data.frame(t(output))
colnames(output) <- c("Trial(n)","Intersections(y)")
ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by zx87547.9k

The purpose is to test the proportion of mutations expected to be identified as cis-regulating elements from a random sample (i.e., a simple index of CREs expected to be encountered by chance search).
Thanks for your insight, I will give this dummy script a trial run.

ADD REPLYlink written 5.5 years ago by hessjl50
3
gravatar for Jarretinha
5.5 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

This is clearly a problem on sets. You just need to count over the intersection of the two sets. Try these functions:

sets {base}

Simple example:

a <- c(1:10)
b <- c(5:15)
a
[1]  1  2  3  4  5  6  7  8  9 10
b
[1]  5  6  7  8  9 10 11 12 13 14 15
intersect(a,b)
[1]  5  6  7  8  9 10
#Looping part
for(i in 1:10){
    c <- sample(a, 5)
    intersect(c, b)
}

Now, it's up to you adapt the code. I suggest you to read about apply functions. They are quite handy in situations like yours. If your elements are intervals of coordinates you could use findInterval.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Jarretinha3.3k

I think that's a fair solution for solving the problem related to overlapping elements, but there's a critical gap remaining. I intend to perform this "intersection" on only a sample of a items, yet need to do so over and over. I'm not concerned with the knowing the absolute intersection, only the intersection from smaller samples of a to the whole b.
How might I perform this iteratively with a simple loop, and ask to write the ouput into a text file formatted as follows:

Trial (n) Intersections (y)

[1] 500

[2] 154

[3] 10

[4] 95

.

.

.

[10000] 299

ADD REPLYlink written 5.5 years ago by hessjl50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1671 users visited in the last hour