K-Means Clustering
1
1
Entering edit mode
10.6 years ago
Maria ▴ 10

I want to do k-means clustering on 5 chip-seq samples (time series). I counted number of tags over 1kb window genome wide for each sample. then I have an input like:

chr1  343000   344000   23    43  5   78    45
.
.
.

I would like to find those regions that gain the histone mark signal faster or slower. I never did clustering and I have some basic questions. what would be my input? If I provide a numeric matrix from tag counts (5 columns), how can I keep coordinates during clustering?

PS: would be a great help if somebody can show me a step by step tutorial on these kinds of stuff.

• 4.2k views
ADD COMMENT
7
Entering edit mode
10.6 years ago

The input would be the 5 columns of counts or any similar metric that you want to use. You can keep the coordinates by either making them the row.name (so "chr1:343000-344000" for the first row", or just subset the data frame when you give it to kmeans. The output from the kmeans function (such as $cluster) are in the same order as the input, so you don't have to worry about things getting rearranged. There are a number of nice tutorials on the web, such as this one here. BTW, you might try using something like seqMINER, which can do the clustering for you (though I've never used it).

ADD COMMENT
1
Entering edit mode

Seqminer will do the job for you. You provide a BED file of coordinates for the regions (X-axis) and a BED file of mapped reads for each conditions, so five in your case. You can alter the bin length in the options. It is recommended that the reads be normalised within Seqminer using the linear-normalisation setting.

ADD REPLY

Login before adding your answer.

Traffic: 2095 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6