Entering edit mode
6.8 years ago
christopher medway
▴
460
I posted this question on stackoverflow, but it did not get a response. It is a bioinformatics query, so perhaps this is a better forum: Is there are pythonic equivalent to the R ranges operation below?
In R (albeit longwinded):
Here is a test data.frame
df <- data.frame(
"CHR" = c(1,1,1,2,2),
"START" = c(100, 200, 300, 100, 400),
"STOP" = c(150,350,400,500,450)
)
First I make GRanges object:
gr <- GenomicRanges::GRanges(
seqnames = df$CHR,
ranges = IRanges(start = df$START, end = df$STOP)
)
Then I reduce the intervals to collapse into new granges object:
reduced <- reduce(gr)
Now append a new column to original dataframe which confirms which rows belong to the same contiguous 'chunk'.
subjectHits(findOverlaps(gr, reduced))
Output:
> df
CHR START STOP locus
1 1 100 150 1
2 1 200 350 2
3 1 300 400 2
4 2 100 500 3
5 2 400 450 3
How do I do this in Python?
in what structure is your data stored in python?
The data is stored as a CSV to disk. As a python newbie, I guess I would load as a pandas table, but I am open to suggestions.