Clustering Million Points : Transcript Start And End Sites
3
3
Entering edit mode
12.1 years ago
Abhi ★ 1.6k

Hey Guys

I have about million points per chromosome that are basically putative start and end regions of transcripts.

sample data here : http://pastebin.com/CqrtZpbE

My goal is to cluster them with some unsupervised learning algorithm like DBSCAN (open to suggestions) which can handle big datasets in O(nlogn) time if possible, treating each start/end point as X,Y coordinates. So in each tuple the first point could be treated as X and other as Y coordinate making this a 2 dimensional data.

Any ideas ? I need to make sure I dont overshoot the memory during distance calculation.

What I would like to get out is set of clusters with number of points in each which will give me a putative set of transcript (may be fragmented) from these points.

Keeping the biology aside I am looking for a scalable method of clustering points(x,y) in a 2 dimensional space.

============================================================

data snapshots :

A. I have added data visual snapshot from IGV (sequencing reads on a browser) to give everyone more clarity. If you could open the picture from the following link and follow my description below http://picpaste.com/pics/sense_1-9nCST8qa.1334264280.png

The green reads depict the 5 prime start and orange points to the 3-prime site shown by each read pair. I have drawn few possible groups which we would like to cluster(group) together. By this I dont mean to propose using cluster algo to do this but somehow extract the possible putative transcripts. The snapshot is for demo sake. It does contains PCR duplicates and putative transcripts are approximate to give you guys an idea. The reason start and end points have to be grouped together is they combinedly demarcate a putative transcript boundary. Grouping start and end points separately will result in loss of connectivity and for example in the picture transcript 2 and 3 boundaries will be difficult to ascertain.

Following is just one way I tried to cluster the same region show above.

B. Done using DBSCAN. I reduced the number of points by removing the PCR duplicates which anyways doesnt convey any biological meaning. So only uniq start and end points (start,end) were used for clustering.

http://picpaste.com/pics/sample_clustering_result-IMhAXNDS.1334266188.png

The x axis is the 5prime start of the transcript and Y axis is the 3 prime. From the plot for it seems like we have 7 clusters (7 possible putative transcripts) out of which one in the top corner is most prevalent.

We can then filter and test based on how many uniq points(start,end) are present in the each cluster. Basically each uniq point is a uniq clone sequenced for that region.

I hope this is not too much information and in case you have questions please feel free to ask questions. I appreciate your help until now.

==============================================================

Thanks! -Abh

clustering • 5.0k views
ADD COMMENT
2
Entering edit mode

Also, please explain the biological question that you had in mind.

ADD REPLY
1
Entering edit mode

Are you sure you need to cluster interval data? It is sort of 1-dimensional isn't it? IMHO clustering only makes sense on multi dimensional data. Please explain why you think clustering is an option.

ADD REPLY
0
Entering edit mode

@Michael - clustering can be done on 1-d data too, but yes - not sure what biological question it serves in this case.

ADD REPLY
0
Entering edit mode

@Michael : each start,end in the input data is putative transcript start and end coordinates per chromosome per strand. These are coming from a sequencing experiment. The goal is to cluster these into putative transcript start and end sites to give us the boundaries of transcripts. We can treat start,end as a X,Y coordinate on a chromosome scale and cluster the denser spots that way. makes sense ??

ADD REPLY
0
Entering edit mode

@Michael : I have also updated the question with a link to the sample data

ADD REPLY
0
Entering edit mode

Why not just convert your coordinates into a fake coverage file and filter for the highest covered region? I don't think separating start/end into two dimensions is a good ideal.

ADD REPLY
0
Entering edit mode

@DK : we are interested in finding the possible trasncripts start and end regions and not just look at the highest coverage ones. I am already filtering out any point with coverage < 3x before compiling a list of these points.

ADD REPLY
0
Entering edit mode

I don't see what clustering could do for you here.

ADD REPLY
0
Entering edit mode

So, what do your tuples represent, specifically? Are these putative exons or transcript start/end points? I see where you are going here, but we need more information.

ADD REPLY
0
Entering edit mode

@Sean : these are start/end points for a transcript. Since the transcripts can be fragmented (bias in the lib prep) or due to alternative splicing there can be many different transcripts of varied lengths / gene. We want to cluster these points together and see how they look. I would just take these as X and Y coordinates(x=start, y=end) and cluster in 2d. Not sure what would be a good scalable approach. I tried DBSCAN but I am hitting a memory bottleneck in python. Trying the same algo in R now

ADD REPLY
0
Entering edit mode

Nope, that's not a good idea IMO. You shouldn't choose a method before defining your problem sufficiently, and I don't believe you should insist on using clustering,when you are unable to describe benefits and have been advised against. In addition your problem is not 2dimensional but 1 dimensional (intervals) thus to be able to provide a good answer, you should first define your problem in detail and sufficiently, and not do the second step before the first.

ADD REPLY
0
Entering edit mode

@Michael : thanks for following up. May be I dint do a good job explaining the problem but as far as I can I think I have made the rationale behind clustering clear. We want to group together close (how close we can see based on few trials) start and end positions to distinguish different isoforms of a gene as each start and end position points to a putative transcript start and end site. Please let me if this is still not clear ..Aprreciate everyone's comments so far..thanks

ADD REPLY
0
Entering edit mode

There is a more easy way, I suppose. Think about your transcript as intervals. Intervals can be easily treated in the IRanges package in R. Also, intervals can be easily binned, and distance between them is easily inferred from position, overlaps can be computed and intervals joined. that is also why clustering start,end vectors doesn't make sense, because they are dependent.

ADD REPLY
0
Entering edit mode

So, please try to answer the following question: given a set of intervals of potential transcripts (from which tool,pipeline btw?), which formal criteria can be given to annotate a transcript in a certain location. If you can answer this, we can maybe make a script to solve the task.

ADD REPLY
0
Entering edit mode

@Micheal: The putative start and end of transcripts are obtained from a sequencing pipeline post mapping the reads to the reference genome. Read 1/Read 2 is coming either from 5 or 3 prime end and we can bin the apriori to take the direction into account. Also there could be many start and end points exactly same due to high coverage or may be bias in library construction (PCR etc). From this data we want to find out the possible boundaries of the transcripts and then compliment it with RNA-Seq.

ADD REPLY
0
Entering edit mode

So, your start and ends represent potential exons, it seems. And you want to cluster the exons into potential transcripts? Are you using an aligner that knows about splicing? If not, you should consider doing so. Then, your junction reads define the "connectedness" between exons.

ADD REPLY
0
Entering edit mode

Hi Sean : the start and end point denote the transcript 5 prime and 3 prime sites so the starting point at which the transcript start and the end point of transcription. It only demarcartes the boundary of a putative transcript and not care about the exons present.

ADD REPLY
0
Entering edit mode

This sounds like SAGE, CAGE, or smth similar.

ADD REPLY
0
Entering edit mode

@Sean, @Michael : I have attached data snapshots with some description. I hope this helps clarifies the problem. Thanks

ADD REPLY
1
Entering edit mode
12.1 years ago

You might take a look at this post that uses the bx-python ClusterTrees.

ADD COMMENT
1
Entering edit mode

While this could certainly be useful I find this example way overcomplicated for the little it does, like reading the data into an interval tree structure.

ADD REPLY
0
Entering edit mode

I agree, Michael. The ClusterTree class could probably be implemented in R/Bioconductor easily using a combination of coverage and findOverlaps after loading the data into an appropriate Ranges object. I answered this way to redirect @Abhi to what I thought was the correct approach; the implementation details were unimportant (though having an implementation is always a good thing).

ADD REPLY
0
Entering edit mode

Thanks Sean..interesting coincidence. I came across bx-python yesterday and I think the methods available could be useful. I will test it you and let you guys know if this works. -A

ADD REPLY
1
Entering edit mode
12.1 years ago
g1ddriver ▴ 20

If your goal is to find consensus starting and ending points of some regions there is no need to process all of the points at the same time. In simple words, the points at one end of the chromosome don't carry any information for the regions at the other end of the chromosome. I don't see any need for 2-dimensions here either, you have clearly a 1-dimensional case.

The simplest way to solve the scalability problem is to slide a window across the chromosome and apply your DBSCAN-based clustering only on points within the window for starting (X) and ending (Y) points separately. The size of the window should be picked such that RAM constraints are met. Eventually, you will get a set of consensus starting and ending points (cluster centers). You can match X and Y consensus points afterwords based on the input pairs (X_i,Y_i), where X_i belongs to cluster X and Y_i cluster Y, if you need.

ADD COMMENT
0
Entering edit mode

@g1ddriver : thanks for your input. Few comments. I plan to cluster by chromosome and also the start and end points are dependent. Please allow me to explain. Each fragment is represented in the dataset by its start and end point which inherently builds a relation between them. They come from a single biological transcript fragment so clustering start and end separately will not make sense.

-Abhi

ADD REPLY
0
Entering edit mode

Abhi, please listen to advice people give you, you seem to insist you have to 'cluster', because..... yes only you possibly know the reason.

ADD REPLY
0
Entering edit mode

Hi Michael

Absolutely dint mean to ignore the advice given here although I believe it does seems like it. I guess I have also over used the term clustering. Once I get back to my station I will post a snapshot of the data on a browser to give everyone a visual. -A

ADD REPLY
0
Entering edit mode

Yes, let's try to get back to getting a good understanding of the problem first. Then we can figure out how best to solve it technically.

ADD REPLY
0
Entering edit mode

I have attached data snapshots with some description. I hope this helps clarifies the problem. Thanks

ADD REPLY
0
Entering edit mode

Hi Guys .. just thought I will remind you . I have uploaded the actual data snapshots on the main question above. I hope it will give everyone some clarity. Thanks

ADD REPLY
0
Entering edit mode
12.1 years ago

If you can covert your problem into a MapReduce data structure, you could use Apache Mahout for your clustering task. See clustering under quickstart for a short introduction.

ADD COMMENT

Login before adding your answer.

Traffic: 2810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6