Hey Guys

I have about million points per chromosome that are basically putative start and end regions of transcripts.

sample data here : http://pastebin.com/CqrtZpbE

My goal is to cluster them with some unsupervised learning algorithm like DBSCAN (open to suggestions) which can handle big datasets in O(nlogn) time if possible, treating each start/end point as X,Y coordinates. So in each tuple the first point could be treated as X and other as Y coordinate making this a 2 dimensional data.

Any ideas ? I need to make sure I dont overshoot the memory during distance calculation.

What I would like to get out is set of clusters with number of points in each which will give me a putative set of transcript (may be fragmented) from these points.

Keeping the biology aside I am looking for a scalable method of clustering points(x,y) in a 2 dimensional space.

============================================================

data snapshots :

A. I have added data visual snapshot from IGV (sequencing reads on a browser) to give everyone more clarity. If you could open the picture from the following link and follow my description below http://picpaste.com/pics/sense_1-9nCST8qa.1334264280.png

The green reads depict the 5 prime start and orange points to the 3-prime site shown by each read pair. I have drawn few possible groups which we would like to cluster(group) together. By this I dont mean to propose using cluster algo to do this but somehow extract the possible putative transcripts. The snapshot is for demo sake. It does contains PCR duplicates and putative transcripts are approximate to give you guys an idea. The reason start and end points have to be grouped together is they combinedly demarcate a putative transcript boundary. Grouping start and end points separately will result in loss of connectivity and for example in the picture transcript 2 and 3 boundaries will be difficult to ascertain.

Following is just one way I tried to cluster the same region show above.

B. Done using DBSCAN. I reduced the number of points by removing the PCR duplicates which anyways doesnt convey any biological meaning. So only uniq start and end points (start,end) were used for clustering.

http://picpaste.com/pics/sample_clustering_result-IMhAXNDS.1334266188.png

The x axis is the 5prime start of the transcript and Y axis is the 3 prime. From the plot for it seems like we have 7 clusters (7 possible putative transcripts) out of which one in the top corner is most prevalent.

We can then filter and test based on how many uniq points(start,end) are present in the each cluster. Basically each uniq point is a uniq clone sequenced for that region.

I hope this is not too much information and in case you have questions please feel free to ask questions. I appreciate your help until now.

==============================================================

Thanks! -Abh

Also, please explain the biological question that you had in mind.

Are you sure you need to cluster interval data? It is sort of 1-dimensional isn't it? IMHO clustering only makes sense on multi dimensional data. Please explain why you think clustering is an option.

@Michael - clustering can be done on 1-d data too, but yes - not sure what biological question it serves in this case.

@Michael : each start,end in the input data is putative transcript start and end coordinates per chromosome per strand. These are coming from a sequencing experiment. The goal is to cluster these into putative transcript start and end sites to give us the boundaries of transcripts. We can treat start,end as a X,Y coordinate on a chromosome scale and cluster the denser spots that way. makes sense ??

@Michael : I have also updated the question with a link to the sample data

Why not just convert your coordinates into a fake coverage file and filter for the highest covered region? I don't think separating start/end into two dimensions is a good ideal.

@DK : we are interested in finding the possible trasncripts start and end regions and not just look at the highest coverage ones. I am already filtering out any point with coverage < 3x before compiling a list of these points.

I don't see what clustering could do for you here.

So, what do your tuples represent, specifically? Are these putative exons or transcript start/end points? I see where you are going here, but we need more information.

@Sean : these are start/end points for a transcript. Since the transcripts can be fragmented (bias in the lib prep) or due to alternative splicing there can be many different transcripts of varied lengths / gene. We want to cluster these points together and see how they look. I would just take these as X and Y coordinates(x=start, y=end) and cluster in 2d. Not sure what would be a good scalable approach. I tried DBSCAN but I am hitting a memory bottleneck in python. Trying the same algo in R now

Nope, that's not a good idea IMO. You shouldn't choose a method before defining your problem sufficiently, and I don't believe you should insist on using clustering,when you are unable to describe benefits and have been advised against. In addition your problem is not 2dimensional but 1 dimensional (intervals) thus to be able to provide a good answer, you should first define your problem in detail and sufficiently, and not do the second step before the first.

@Michael : thanks for following up. May be I dint do a good job explaining the problem but as far as I can I think I have made the rationale behind clustering clear. We want to group together close (how close we can see based on few trials) start and end positions to distinguish different isoforms of a gene as each start and end position points to a putative transcript start and end site. Please let me if this is still not clear ..Aprreciate everyone's comments so far..thanks

There is a more easy way, I suppose. Think about your transcript as intervals. Intervals can be easily treated in the IRanges package in R. Also, intervals can be easily binned, and distance between them is easily inferred from position, overlaps can be computed and intervals joined. that is also why clustering start,end vectors doesn't make sense, because they are dependent.

So, please try to answer the following question: given a set of intervals of potential transcripts (from which tool,pipeline btw?), which formal criteria can be given to annotate a transcript in a certain location. If you can answer this, we can maybe make a script to solve the task.

@Micheal: The putative start and end of transcripts are obtained from a sequencing pipeline post mapping the reads to the reference genome. Read 1/Read 2 is coming either from 5 or 3 prime end and we can bin the apriori to take the direction into account. Also there could be many start and end points exactly same due to high coverage or may be bias in library construction (PCR etc). From this data we want to find out the possible boundaries of the transcripts and then compliment it with RNA-Seq.

So, your start and ends represent potential exons, it seems. And you want to cluster the exons into potential transcripts? Are you using an aligner that knows about splicing? If not, you should consider doing so. Then, your junction reads define the "connectedness" between exons.

Hi Sean : the start and end point denote the transcript 5 prime and 3 prime sites so the starting point at which the transcript start and the end point of transcription. It only demarcartes the boundary of a putative transcript and not care about the exons present.

This sounds like SAGE, CAGE, or smth similar.

@Sean, @Michael : I have attached data snapshots with some description. I hope this helps clarifies the problem. Thanks