Question: Splitting a dataframe into overlapping groups of equal sizes
gravatar for user_g
2.5 years ago by
user_g20 wrote:

Hello, I am looking for a way to split my data into groups where each group is made of the same window size I define.

Chrom     Start   End        
 chr1       1    10      
 chr1       11   20      
 chr1       21   30     
 chr1       31   40

For example, if I want a window size of 20, then the groups would be : 1-20 , 11-30 , 21 - 40. As long as the size of the group did not exceed 20 it can keep adding to the same group.

I tried using the split function but couldn't implement this way using it. Is there a way around this?

irange grange genomicrange R • 857 views
ADD COMMENTlink written 2.5 years ago by user_g20

Some questions :

Do you have a dataframe or a GRange ? Your example data looks like a dataframe but you mentionned GRange.

Can a same row goes to different groups ?

Also, is the start column automatically create a new dataframe ? By example if you had a row c("chr1","16","25"), this will create a dataframe from 16 to 35. In this case you will have as many dataframe as rows...

What do you want to achieve after that splicing ?

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.8k

I am alternating between data frames and GRanges to find the perfect way to achieve this, so I if I could find a way to do this in GRanges then I will convert my data frame into a GRange object.

Yes the same row can be in another group.

Yes thats true, I will end up having the same number of clusters as the number of rows but these clusters will be rows in a data frame or GRange object not each row an independent data frame.

I need these clusters to study them further in the next stage.

ADD REPLYlink written 2.5 years ago by user_g20

Why not use a for loop over your dataframe and then do your process in the loop ?

Something like this :

df <- data.frame(c("chr1", "chr1", "chr1", "chr1"), c(1, 11, 21, 31), c(10, 20, 30, 40))
colnames(df) <- c("chrom", "start", "end")

for (row in 1:nrow(df)) {
    df_cluster <- df[df$start >= df[row, "start"] & df$start < df[row, "start"]+window_size,]
    ###Here you can process each cluster
    ###Create your GRange
    my_GRange <- toGRanges(df_cluster)}

If you really need all your GRanges at the same time you can create a list of GRanges before the loop, append it in the loop and use it after the loop.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Bastien Hervé4.8k

hello, yes I tried using the for loop but when dealing with large data, it became very slow this is why I am looking for another way

ADD REPLYlink written 2.5 years ago by user_g20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1733 users visited in the last hour