Quick Programming Challenge: Calculate Common And Unique Regions From A List Of Chromosome Segments
2
6
Entering edit mode
12.9 years ago

I have a files with data like this for various chromosomes in bp

File 1:
24328000-29166946
25388351-27114603
22310186-25239677
28511024-29638159
23729632-26385029


Solution in Perl code: My code reports no common region in this sections

File 2:
35388351-55114603
37310186-52396773
38511024-48638500
32729632-48638502
17446360-51119526


Solution using Perl code: (as of now it reports only common region):

38511024 -  48638500


Looking for a solution to find the unique and common regions in Perl or other languages.

programming • 3.9k views
2
Entering edit mode

Are you looking for exact matches between the 2 files? if so, use a set.

Are you looking for overlaps? if so try this in bx-python and see and see Istvan's examples in this thread.

1
Entering edit mode

why don't you write a doctest to show how the program should be called, and which outputs are expected for a certain input? http://docs.python.org/library/doctest.html

0
Entering edit mode

Thanks. The links looks interesting. I will check them. I moved the code to pastebin. I am not looking for exact match between files. I have various files like 1, 2...I am trying to identify the common regions and unique regions among different segments in a given file (one file at a time).

0
Entering edit mode

example of doctest: http://pastebin.com/s5knZ9VR is it correct?

0
Entering edit mode

@giovanni : thanks for the suggestion. You think we need a doctest here ? It is such a simple problem. I have many input files and I just showed two random input files that I have. I am looking for a general solution that can report unique and common region - if they are available. Looks like it is not a quick challenge, should I change the title of my question ?

0
Entering edit mode

The terms of this problem should be a bit better specified. I am unsure of what you mean by unique and common. Could you redefine the problem in terms of overlap of the intervals?

0
Entering edit mode

I don't understand the question. What is the 'common' region ? if no segments overlap what sould be the result (all unique ?) ? ... if all segments overlap but one ?, if A overlap B and B overlap C but C doesn't overlap A ? etc...

0
Entering edit mode

@Khader: Is it correct for 2 segments (lets say 1-5 and 3-7), that common segment is 3-5, and uniques are 1-3 on the 1st segment, and 5-7 on the 2nd?

0
Entering edit mode

@Istavan Thanks for the suggestion. Input file is a set of chromosome intervals. Output 1 : I need to find the overlap of intervals in bp between the different intervals. Output 2 :Report unique regions. Second part i have not implemented in my code yet.

0
Entering edit mode

@Pierre : If no overlap - all are unique. If there are partial overlap(thanks for pointing this) report as unique 1, unique 2 etc.

0
Entering edit mode

@Yuri : that's exactly what am looking for - but things get a bit complex when there is partial overlap as pointed by Pierre.

0
Entering edit mode

@khader: a doctest would be just a way to show what is the expected output, what is the solution you expect from an answer. Moreover, I would simplify the input files that you provide: instead of 24328000-29166946 I would use 10000-20000, which is a lot easier to read and thus easier to test. Have a look at the doctest I posted, there are three cases that you should define (e.g. are two segments overlapping when one begins on the same base where the other ends?)

0
Entering edit mode

let's see if later today I can answer this..

5
Entering edit mode
12.9 years ago
Casbon ★ 3.2k
import string, sys
data = map(string.split, sys.stdin)
starts = sorted(map(lambda x: int(x[0]), data))
ends = sorted(map(lambda x: int(x[-1]), data))
borders = sorted(starts + ends)

for (left, right) in zip(borders, borders[1:]):
covering_count = len([x for x in starts if x<= left]) - len([x for x in ends if x<= left])
print left, right, covering_count

$python so.py < input 22310186 23729632 1 23729632 24328000 2 24328000 25239677 3 25239677 25388351 2 25388351 26385029 3 26385029 27114603 2 27114603 28511024 1 28511024 29166946 2 29166946 29638159 1  ADD COMMENT 3 Entering edit mode maybe you should just type out what the answer should be for this example ADD REPLY 0 Entering edit mode Your code works, but am afraid the output is not the way that I expected. I am looking for output in 2 levels : Unique segment (if any) among the 5 regions in bp Commmon region (if any) among the 5 regions in bp ADD REPLY 0 Entering edit mode Added my current code to the question. It reports no common region, so i have added another example. ADD REPLY 0 Entering edit mode @Jeremy : Any interesting solution via R or any BioC modules that can do this ? ADD REPLY 0 Entering edit mode P.S. I made a small change to input file after @Casbon provided his solution. ADD REPLY 5 Entering edit mode 12.9 years ago Here is my interpretation in R library("IRanges") mytab<-read.table("file2.txt",sep="-",col.names=c("s","e")) myrange<-IRanges(start=mytab$s,end=mytab\$e)

#take whatever depth is given and print ranges with that depth
printRangesByDepth<-function(d){
pos<-which(coverage(myrange)==d)
write.table(as.data.frame(reduce(IRanges(pos,pos)))[,c(1,2)],row.names=FALSE,col.names=FALSE,sep="-")}

#which locations are common to all, i.e. have depth equal to sum of all ranges
cat("Common:\n")
printRangesByDepth(length(myrange))

#which locations are unique? depth of one
cat("Unique:\n")
printRangesByDepth(1)


file1:

Common:
Unique:
22310186-23729631
27114604-28511023
29166947-29638159


file2:

Common:
38511024-48638500
Unique:
17446360-32729631
52396774-55114603

0
Entering edit mode

Awesome ! I didn't know about IRanges. Thanks. (Please, Can I vote it up twice :) ?)