R Help: For Loop Over Range Of Number And Calculate Average Using An If Statement
2
4
Entering edit mode
9.6 years ago
714 ▴ 110

Hi guys,

I have a file (named DP.2L) which looks like this:

            CHROM        POS       SAMPLE_1
1            1168      47
1            1197      40
1            1202      45


POS ranges from 1168 to 49359284. I generated a sequence vector containing so that I get bin sizes starting from 1168, to 49359284 by 10,000 steps:

            > seqWrapper <- function(lb, ub, by=1) {
+  s <- c()
+ if(!ub <= lb) s <- seq(lb,ub, by=by)
+ return(s)
+ }
> x<-seqWrapper(lb=1168, ub=49359284,by=10000)


I want to sequentially calculate the average values of SAMPLE_1 if the value of x is less than or equal to the value of POS but more than the value of the next value of x. For example if x values are :1168 11168 21168 , then I want to first look at the first x value (1168) and compare it with the POS column. I want it to calculate the average of the corresponding rows in SAMPLE_1 if the value of its corresponding POS is less than or equal to the first values of x (1168). But for the next x value (11168), I want the average should only be calculated x is <= POS but > the first x value (1168).

I got as far as this but its not right....

            t<- for (i in x) { if (DP.2L<=x & DP.2L > x[i]) {(mean(DP.2L$SAMPLE_1))} }  r • 55k views ADD COMMENT 1 Entering edit mode Your analysis is not really 'sliding-window' but 'consecutive windows' because in sliding-windows are overlapping in my understanding, but according to your definition your regions do not overlap. I suggest to consider whether this is really what you want and if it is appropriate at all. I can just guess that you have data from a a tiling array or ChIP-chip experiment, but in this case for plotting, a real sliding window approach might be preferable. ADD REPLY 0 Entering edit mode Yes, Michael what I actually meant to say was a non-overlapping window. As I understood a sliding window can be overlapping or non-overlapping, because the term "sliding" refers to the analysis sliding across the data. ADD REPLY 6 Entering edit mode 9.6 years ago Michael 53k In short, if you use a for loop in R, you are most likely doing it wrong. Convert your data into a Rle object (run-length encoded data) and use IRanges Views to accomplish a more efficient solution of the naive implementation (corrected your approach, didn't test it though, because it is inefficient to uselessness for larger datasets). Now let's take this implementation as a definition of what you want to accomplish:  res = numeric(); for (i in 1:length(x)-1) { tmp <- DP.2L[DP.2L$POS<=x[i+1] & DP.2L$POS > x[i], ]; res = c(res,(mean(tmp$SAMPLE_1));
}


The IRanges Views and RLE - classes provides a toolset to approach this most efficiently and elegantly:

library(IRanges)
## first create some dummy data
rle <- Rle(NA,lengths=49359284) # Rle-vector
## you'd use:
# ind <- DP.2L$pos ind <- seq(1, 49359284, by=10) # insert data at every 10. pos rle[ind] <- rpois(length(ind), 40) # let data be poisson distributed around 40, as in your example ;) ## ofc you'd use instead: # rle[ind] <- DP.2L$SAMPLE_1
## now create all windows of size 10000 in one go, starting from position 1168 as required:
my.views <- successiveViews(rle, width=rep(10000, 49359284/10000), from=1168)
my.views
## ...
## voila, compute the means of all your windows
viewMeans(rle.views, na.rm=TRUE)


Now, you can compare the performance to any other solution on your data.

4
Entering edit mode
9.6 years ago
zx8754 11k

It would help if you clarify the "why", the reason for doing what you are trying to do, there might be a better way.

Anyway, this example should help with your question:

#dummy data
df <- data.frame(CHROM=rep(1,100),
POS=1:100,
SAMPLE_1=round(runif(100,40,80)))
#create bins
bin <- seq(1,100,5)

#get mean per bin
sapply(1:(length(bin)-1),
function(i)
mean(df[ df$POS>=bin[i] & df$POS < bin[i+1], "SAMPLE_1"]))

0
Entering edit mode

Thanks so much! Sorry I forgot to mention that I am trying to perform a sliding window analysis using genomic data