Question: R Help: For Loop Over Range Of Number And Calculate Average Using An If Statement
4
gravatar for 714
6.4 years ago by
71490
71490 wrote:

Hi guys,

I have a file (named DP.2L) which looks like this:

            CHROM        POS       SAMPLE_1     
            1            1168      47
            1            1197      40
            1            1202      45

POS ranges from 1168 to 49359284. I generated a sequence vector containing so that I get bin sizes starting from 1168, to 49359284 by 10,000 steps:

            > seqWrapper <- function(lb, ub, by=1) {
            +  s <- c()
            + if(!ub <= lb) s <- seq(lb,ub, by=by)
            + return(s)
            + }
            > x<-seqWrapper(lb=1168, ub=49359284,by=10000)

I want to sequentially calculate the average values of SAMPLE_1 if the value of x is less than or equal to the value of POS but more than the value of the next value of x. For example if x values are :1168 11168 21168 , then I want to first look at the first x value (1168) and compare it with the POS column. I want it to calculate the average of the corresponding rows in SAMPLE_1 if the value of its corresponding POS is less than or equal to the first values of x (1168). But for the next x value (11168), I want the average should only be calculated x is <= POS but > the first x value (1168).

I got as far as this but its not right....

            t<- for (i in x) { if (DP.2L<=x & DP.2L > x[i]) {(mean(DP.2L$SAMPLE_1))} }
R • 48k views
ADD COMMENTlink modified 4.5 years ago by Biostar ♦♦ 20 • written 6.4 years ago by 71490
1

Your analysis is not really 'sliding-window' but 'consecutive windows' because in sliding-windows are overlapping in my understanding, but according to your definition your regions do not overlap. I suggest to consider whether this is really what you want and if it is appropriate at all. I can just guess that you have data from a a tiling array or ChIP-chip experiment, but in this case for plotting, a real sliding window approach might be preferable.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Michael Dondrup47k

Yes, Michael what I actually meant to say was a non-overlapping window. As I understood a sliding window can be overlapping or non-overlapping, because the term "sliding" refers to the analysis sliding across the data.

ADD REPLYlink written 6.4 years ago by 71490
6
gravatar for Michael Dondrup
6.4 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

In short, if you use a for loop in R, you are most likely doing it wrong.

Convert your data into a Rle object (run-length encoded data) and use IRanges Views to accomplish a more efficient solution of the naive implementation (corrected your approach, didn't test it though, because it is inefficient to uselessness for larger datasets).

Now let's take this implementation as a definition of what you want to accomplish:

    res = numeric();
    for (i in 1:length(x)-1) { 
      tmp <- DP.2L[DP.2L$POS<=x[i+1] & DP.2L$POS > x[i], ];
      res = c(res,(mean(tmp$SAMPLE_1)); 
    }

The IRanges Views and RLE - classes provides a toolset to approach this most efficiently and elegantly:

library(IRanges)
## first create some dummy data
rle <- Rle(NA,lengths=49359284) # Rle-vector
## you'd use:
# ind <- DP.2L$pos
ind <- seq(1, 49359284, by=10) # insert data at every 10. pos
rle[ind] <- rpois(length(ind), 40) # let data be poisson distributed around 40, as in your example ;) 
## ofc you'd use instead:
# rle[ind] <- DP.2L$SAMPLE_1
## now create all windows of size 10000 in one go, starting from position 1168 as required:
my.views <- successiveViews(rle, width=rep(10000, 49359284/10000), from=1168)
my.views
## ... 
## voila, compute the means of all your windows
viewMeans(rle.views, na.rm=TRUE)

Now, you can compare the performance to any other solution on your data.

see also here: https://stat.ethz.ch/pipermail/bioconductor/2010-September/035532.html

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Michael Dondrup47k
4
gravatar for zx8754
6.4 years ago by
zx87549.4k
London
zx87549.4k wrote:

It would help if you clarify the "why", the reason for doing what you are trying to do, there might be a better way.

Anyway, this example should help with your question:

#dummy data
df <- data.frame(CHROM=rep(1,100),
                 POS=1:100,
                 SAMPLE_1=round(runif(100,40,80)))
#create bins
bin <- seq(1,100,5)

#get mean per bin
sapply(1:(length(bin)-1),
       function(i)
         mean(df[ df$POS>=bin[i] &
                    df$POS < bin[i+1], "SAMPLE_1"]))
ADD COMMENTlink written 6.4 years ago by zx87549.4k

Thanks so much! Sorry I forgot to mention that I am trying to perform a sliding window analysis using genomic data

ADD REPLYlink written 6.4 years ago by 71490
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1582 users visited in the last hour