Entering edit mode
                    7.7 years ago
        mbyvcm
        
    
        ▴
    
    480
    I posted this question on stackoverflow, but it did not get a response. It is a bioinformatics query, so perhaps this is a better forum: Is there are pythonic equivalent to the R ranges operation below?
In R (albeit longwinded):
Here is a test data.frame
df <- data.frame(
  "CHR" = c(1,1,1,2,2),
  "START" = c(100, 200, 300, 100, 400),
  "STOP" = c(150,350,400,500,450)
  )
First I make GRanges object:
gr <- GenomicRanges::GRanges(
  seqnames = df$CHR,
  ranges = IRanges(start = df$START, end = df$STOP)
  )
Then I reduce the intervals to collapse into new granges object:
reduced <- reduce(gr)
Now append a new column to original dataframe which confirms which rows belong to the same contiguous 'chunk'.
subjectHits(findOverlaps(gr, reduced))
Output:
> df
  CHR START STOP locus
1   1   100  150     1
2   1   200  350     2
3   1   300  400     2
4   2   100  500     3
5   2   400  450     3
How do I do this in Python?
in what structure is your data stored in python?
The data is stored as a CSV to disk. As a python newbie, I guess I would load as a pandas table, but I am open to suggestions.