Question: R Help: For Loop Over Range Of Number And Calculate Average Using An If Statement
4
71490 wrote:

Hi guys,

I have a file (named DP.2L) which looks like this:

``````            CHROM        POS       SAMPLE_1
1            1168      47
1            1197      40
1            1202      45
``````

POS ranges from 1168 to 49359284. I generated a sequence vector containing so that I get bin sizes starting from 1168, to 49359284 by 10,000 steps:

``````            > seqWrapper <- function(lb, ub, by=1) {
+  s <- c()
+ if(!ub <= lb) s <- seq(lb,ub, by=by)
+ return(s)
+ }
> x<-seqWrapper(lb=1168, ub=49359284,by=10000)
``````

I want to sequentially calculate the average values of SAMPLE_1 if the value of x is less than or equal to the value of POS but more than the value of the next value of x. For example if x values are :1168 11168 21168 , then I want to first look at the first x value (1168) and compare it with the POS column. I want it to calculate the average of the corresponding rows in SAMPLE_1 if the value of its corresponding POS is less than or equal to the first values of x (1168). But for the next x value (11168), I want the average should only be calculated x is <= POS but > the first x value (1168).

I got as far as this but its not right....

``````            t<- for (i in x) { if (DP.2L<=x & DP.2L > x[i]) {(mean(DP.2L\$SAMPLE_1))} }
``````
R • 48k views
modified 4.5 years ago by Biostar ♦♦ 20 • written 6.4 years ago by 71490
1

Your analysis is not really 'sliding-window' but 'consecutive windows' because in sliding-windows are overlapping in my understanding, but according to your definition your regions do not overlap. I suggest to consider whether this is really what you want and if it is appropriate at all. I can just guess that you have data from a a tiling array or ChIP-chip experiment, but in this case for plotting, a real sliding window approach might be preferable.

Yes, Michael what I actually meant to say was a non-overlapping window. As I understood a sliding window can be overlapping or non-overlapping, because the term "sliding" refers to the analysis sliding across the data.

6
Michael Dondrup47k wrote:

In short, if you use a for loop in R, you are most likely doing it wrong.

Convert your data into a `Rle` object (run-length encoded data) and use IRanges `Views` to accomplish a more efficient solution of the naive implementation (corrected your approach, didn't test it though, because it is inefficient to uselessness for larger datasets).

Now let's take this implementation as a definition of what you want to accomplish:

``````    res = numeric();
for (i in 1:length(x)-1) {
tmp <- DP.2L[DP.2L\$POS<=x[i+1] & DP.2L\$POS > x[i], ];
res = c(res,(mean(tmp\$SAMPLE_1));
}
``````

The IRanges Views and RLE - classes provides a toolset to approach this most efficiently and elegantly:

``````library(IRanges)
## first create some dummy data
rle <- Rle(NA,lengths=49359284) # Rle-vector
## you'd use:
# ind <- DP.2L\$pos
ind <- seq(1, 49359284, by=10) # insert data at every 10. pos
rle[ind] <- rpois(length(ind), 40) # let data be poisson distributed around 40, as in your example ;)
# rle[ind] <- DP.2L\$SAMPLE_1
## now create all windows of size 10000 in one go, starting from position 1168 as required:
my.views <- successiveViews(rle, width=rep(10000, 49359284/10000), from=1168)
my.views
## ...
## voila, compute the means of all your windows
viewMeans(rle.views, na.rm=TRUE)
``````

Now, you can compare the performance to any other solution on your data.

4
zx87549.4k wrote:

It would help if you clarify the "why", the reason for doing what you are trying to do, there might be a better way.

Anyway, this example should help with your question:

``````#dummy data
df <- data.frame(CHROM=rep(1,100),
POS=1:100,
SAMPLE_1=round(runif(100,40,80)))
#create bins
bin <- seq(1,100,5)

#get mean per bin
sapply(1:(length(bin)-1),
function(i)
mean(df[ df\$POS>=bin[i] &
df\$POS < bin[i+1], "SAMPLE_1"]))
``````