Question

Calculate Minimum Number Of Transcripts Required To Explain A Detected Set Of Splice Junctions.

0

Entering edit mode

10.7 years ago

Dave Gerrard ▴ 190

I have a list of splice junctions without further gene models. For simplicity all junctions in a set are on the same chromosome. I want to know the minimum number of transcripts that could explain the full set of junctions. A single tranascript is allowed to use any number of junctions so long as they don't overlap. For simplicity, the length of gaps between junctions ('exons') is not important at this stage.

For a non-overlapping set of junctions such as:-

chr start end

chr1   200 300

chr1   350 500

chr1   550 700

a single transcript using all junctions is the minimum (and the answer). I need an algorithm to calculate the minimum number of transcripts for overlapping sets of splice junctions.

e.g. for this set I think the answer should be 3

chr start end

chr1   200 500

chr1   200 300

chr1   250 500

chr1   350 500

For this set I think the answer should be 4

chr start end

chr1   200 500

chr1   200 300

chr1   250 500

chr1   350 500

chr1   350 600

chr1   550 700

I'm going to continue working on it, but if someone out there can get there first...

UPDATE & WARNING: I found a solution (below) for these simple cases. However, the general problem appears very close to the Set Cover Problem, which is NP-complete. This means that, not only does it become very hard to make the calculation, it is very hard to approximate how long the calculation will take for a particular set of data. I have added some examples of performance testing on relatively small sets of data to my TEST code to show this.

rnaseq splicing algorithm • 3.3k views

ADD COMMENT • link 10.6 years ago by Dave Gerrard ▴ 190

1

Entering edit mode

This might be simliar to the Set Cover Problem: http://en.wikipedia.org/wiki/Set_cover_problem

ADD REPLY • link 10.7 years ago by Dave Gerrard ▴ 190

score 0 · Answer 1 · 2013-09-09

My own answer. This was harder than I expected but I don't know yet if it needed to be and I suspect that someone can do it better.

I firstly calculated a matrix to describe which junctions overlapped each other (i.e. were incompatible). This was easy.

I then wrote a recursion function to find all paths using sets of compatible junctions. This was not easy (for me).

I wrapped the above in a master function, which was also necessary to clean up redundant paths. In future this will be extended so that I can put limits on the spacing of gaps (e.g. a minimum exon size).

The resulttant R code is at https://github.com/davetgerrard/utilsGerrardDT/blob/master/minPathNumber.R. Here is an example of how to run the main function on a set of defined junctions:-

source("minPathNumber.R")

my.juncs.simple <- data.frame(chr="chr1", start=c(200,300,400,500), end=c(250,350,450,550))
my.juncs <- data.frame(chr="chr1", start=c(200,200,250,350), end=c(500,300,500,500))
my.juncs.complex <- data.frame(chr="chr1", start=c(200,200,250,350,350,550), end=c(500,300,500,500,600,700))

minPathNumber(my.juncs.simple)        # should be 1

minPathNumber(my.juncs)        # should be 3

minPathNumber(my.juncs.complex)        # should be 4