Calculate Minimum Number Of Transcripts Required To Explain A Detected Set Of Splice Junctions.
1
0
Entering edit mode
10.7 years ago
Dave Gerrard ▴ 190

I have a list of splice junctions without further gene models. For simplicity all junctions in a set are on the same chromosome. I want to know the minimum number of transcripts that could explain the full set of junctions. A single tranascript is allowed to use any number of junctions so long as they don't overlap. For simplicity, the length of gaps between junctions ('exons') is not important at this stage.

For a non-overlapping set of junctions such as:-

chr start end

chr1   200 300

chr1   350 500

chr1   550 700

a single transcript using all junctions is the minimum (and the answer). I need an algorithm to calculate the minimum number of transcripts for overlapping sets of splice junctions.

e.g. for this set I think the answer should be 3

chr start end

chr1   200 500

chr1   200 300

chr1   250 500

chr1   350 500

For this set I think the answer should be 4

chr start end

chr1   200 500

chr1   200 300

chr1   250 500

chr1   350 500

chr1   350 600

chr1   550 700

I'm going to continue working on it, but if someone out there can get there first...

UPDATE & WARNING: I found a solution (below) for these simple cases. However, the general problem appears very close to the Set Cover Problem, which is NP-complete. This means that, not only does it become very hard to make the calculation, it is very hard to approximate how long the calculation will take for a particular set of data. I have added some examples of performance testing on relatively small sets of data to my TEST code to show this.

rnaseq splicing algorithm • 3.3k views
ADD COMMENT
1
Entering edit mode

This might be simliar to the Set Cover Problem: http://en.wikipedia.org/wiki/Set_cover_problem

ADD REPLY
0
Entering edit mode
10.7 years ago
Dave Gerrard ▴ 190

My own answer. This was harder than I expected but I don't know yet if it needed to be and I suspect that someone can do it better.

I firstly calculated a matrix to describe which junctions overlapped each other (i.e. were incompatible). This was easy.

I then wrote a recursion function to find all paths using sets of compatible junctions. This was not easy (for me).

I wrapped the above in a master function, which was also necessary to clean up redundant paths. In future this will be extended so that I can put limits on the spacing of gaps (e.g. a minimum exon size).

The resulttant R code is at https://github.com/davetgerrard/utilsGerrardDT/blob/master/minPathNumber.R. Here is an example of how to run the main function on a set of defined junctions:-

source("minPathNumber.R")

my.juncs.simple <- data.frame(chr="chr1", start=c(200,300,400,500), end=c(250,350,450,550))
my.juncs <- data.frame(chr="chr1", start=c(200,200,250,350), end=c(500,300,500,500))
my.juncs.complex <- data.frame(chr="chr1", start=c(200,200,250,350,350,550), end=c(500,300,500,500,600,700))

minPathNumber(my.juncs.simple)        # should be 1

minPathNumber(my.juncs)        # should be 3

minPathNumber(my.juncs.complex)        # should be 4
ADD COMMENT

Login before adding your answer.

Traffic: 1423 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6