I have a list of splice junctions without further gene models. For simplicity all junctions in a set are on the same chromosome. I want to know the minimum number of transcripts that could explain the full set of junctions. A single tranascript is allowed to use any number of junctions so long as they don't overlap. For simplicity, the length of gaps between junctions ('exons') is not important at this stage.
For a non-overlapping set of junctions such as:-
chr start end
chr1 200 300
chr1 350 500
chr1 550 700
a single transcript using all junctions is the minimum (and the answer). I need an algorithm to calculate the minimum number of transcripts for overlapping sets of splice junctions.
e.g. for this set I think the answer should be 3
chr start end
chr1 200 500
chr1 200 300
chr1 250 500
chr1 350 500
For this set I think the answer should be 4
chr start end
chr1 200 500
chr1 200 300
chr1 250 500
chr1 350 500
chr1 350 600
chr1 550 700
I'm going to continue working on it, but if someone out there can get there first...
UPDATE & WARNING: I found a solution (below) for these simple cases. However, the general problem appears very close to the Set Cover Problem, which is NP-complete. This means that, not only does it become very hard to make the calculation, it is very hard to approximate how long the calculation will take for a particular set of data. I have added some examples of performance testing on relatively small sets of data to my TEST code to show this.
This might be simliar to the Set Cover Problem: http://en.wikipedia.org/wiki/Set_cover_problem