Question

Tss Distance Vs Chipseq Tag K Mean Clustering

0

Entering edit mode

10.9 years ago

kanwarjag ★ 1.2k

I am trying to perform K means clustering on TSS distance vs chipseq tag density. My aim is generate heat map as shown @ Fig 2E of http://www.ncbi.nlm.nih.gov/pubmed/18992931

I generated tag densities using Hommer around TSS 1000/ on both sides (2k total) in a bin of 50 bp It provided me matrix which i take to Cluster 3 to perform K means clustering, I can also use other tools to perform such clustering. However All of them kind of freeze and complain about memmory etc. I am not so very good in command line tools.Having said that I think one of the solution is to reduce the data in the matrix generated by Hommer. I have tried to use filter tools in Cluster 3 but failed to reduce the data. Could some one suggest how I can reduce the data before performing K means clustering. My reads are 50bp and this facor tightly bound around TSS so have selected 1k on either side of TSS.

Thanks

map chipseq • 4.3k views

ADD COMMENT • link updated 10.7 years ago by Biostar 20 • written 10.9 years ago by kanwarjag ★ 1.2k

0

Entering edit mode

What are the spec of the machine you are running the software on? If all software complain about memory an easy fix is to add memory ;-) There are many solution around to cluster big data. 2kb split by 50bp give 400 regions. What is the number of TSS in your matrix?

ADD REPLY • link 10.9 years ago by David ▴ 740

0

Entering edit mode

(A: Why does the Homer tool find TSS sites for so many (41,478) genes?). Hommer identify 41478TSS mapped. X43 columns when I use 1000bp across TSS with 5pb bins. I am using windows 7; 64 bit; 12 GB ram i7 cpu. I also have access to iMAC

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.9 years ago by kanwarjag ★ 1.2k

Ram · Answer 1 · 2013-05-13

k-means algorithms are very efficient and there should be no problem clustering your data. I just ran it in R on a fake data set and it took me only a few second.

> mat <- matrix(rnorm(n=43*41000), ncol=43)
# dimensions
> dim(mat)
[1] 41000    43
> r <- kmeans(mat, centers=3)
Warning message:
did not converge in 10 iterations
> r
K-means clustering with 3 clusters of sizes 13868, 13492, 13640

Cluster means:
[...]

Anyway, my main concern here is that you have a protein in your chip-seq data that bind to almost every TSS in the genome... Maybe you are dealing with DNase data... or the peak calling algorithm had a problem.