I have about million points per chromosome that are basically putative start and end regions of transcripts.
sample data here : http://pastebin.com/CqrtZpbE
My goal is to cluster them with some unsupervised learning algorithm like DBSCAN (open to suggestions) which can handle big datasets in O(nlogn) time if possible, treating each start/end point as X,Y coordinates. So in each tuple the first point could be treated as X and other as Y coordinate making this a 2 dimensional data.
Any ideas ? I need to make sure I dont overshoot the memory during distance calculation.
What I would like to get out is set of clusters with number of points in each which will give me a putative set of transcript (may be fragmented) from these points.
Keeping the biology aside I am looking for a scalable method of clustering points(x,y) in a 2 dimensional space.
data snapshots :
A. I have added data visual snapshot from IGV (sequencing reads on a browser) to give everyone more clarity. If you could open the picture from the following link and follow my description below http://picpaste.com/pics/sense_1-9nCST8qa.1334264280.png
The green reads depict the 5 prime start and orange points to the 3-prime site shown by each read pair. I have drawn few possible groups which we would like to cluster(group) together. By this I dont mean to propose using cluster algo to do this but somehow extract the possible putative transcripts. The snapshot is for demo sake. It does contains PCR duplicates and putative transcripts are approximate to give you guys an idea. The reason start and end points have to be grouped together is they combinedly demarcate a putative transcript boundary. Grouping start and end points separately will result in loss of connectivity and for example in the picture transcript 2 and 3 boundaries will be difficult to ascertain.
Following is just one way I tried to cluster the same region show above.
B. Done using DBSCAN. I reduced the number of points by removing the PCR duplicates which anyways doesnt convey any biological meaning. So only uniq start and end points (start,end) were used for clustering.
The x axis is the 5prime start of the transcript and Y axis is the 3 prime. From the plot for it seems like we have 7 clusters (7 possible putative transcripts) out of which one in the top corner is most prevalent.
We can then filter and test based on how many uniq points(start,end) are present in the each cluster. Basically each uniq point is a uniq clone sequenced for that region.
I hope this is not too much information and in case you have questions please feel free to ask questions. I appreciate your help until now.