Question: Clustering of CNV genomic coordinates, take 2
16 months ago by
United States
Sakti370 wrote:

Dear Biostars,

After searching the internet for quite a while, I have yet to find an easy solution for clustering of human genomic coordinates. This post asked the same question a couple of years ago, but there was no answer as to how one could simply cluster a bed file and be able to graph it in IGV (or any of your favorite genome graphers), and make it look like this figure.

Here's the breakdown of the problem at hand:

Data type: Human CNV data detected by both array and sequencing. Output from these analysis is a .bed file with the CNV positions, similar to this:

chr    start    end    cnv_id    sample_name    sample_category

Clustering type: anything rolls, from k-means to unsupervised.

Question: Are there samples that preferentially cluster together because they share very similar CNV positions? Is this clustering of CNVs meaninful given the sample category (i.e. sick vs normal)?

I have read about CNVTools, which to my understanding needs probe intensities; I could never get iCluster to work; IGVTools doesn't have a clustering function; I'm unsure seqMINER or any other TSS/ChIP clustering algorithm will work with longer stretches of DNA sequence; and everything I have read about clustering methods in R revolves around single genes/values and not genomic coordinates.

It is why I appeal to the Biostars wisdom once more. I'd be grateful if someone could recommend a solution to this problem.



modified 16 months ago by Sean Davis25k • written 16 months ago by Sakti370

What data are you trying to cluster? What is the assay and what is the question you want to answer? Are you dealing with copy number data, or something else? Sequence-based, or array?

modified 16 months ago • written 16 months ago by Sean Davis25k

Hi Sean, thanks for commenting. I have updated the post with the answers to your questions.

written 16 months ago by Sakti370
16 months ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

There is not a general approach to dealing with these types of data that I know of and you have multiple questions that you seem to be asking of your data. That said, one approach you might find useful to define a set of genomic "bins" across the genome and then build a matrix of: SAMPLE x BIN. Each cell of the matrix has a TRUE (or 1) if the sample has a CNV that overlaps that genomic region. Tools like bedtools or GenomicRanges might help with that task.

From there, more standard matrix-based approaches are available for clustering and statistical testing.

written 16 months ago by Sean Davis25k

Thanks a lot Sean! I was pondering the genomic bins solution, which seems what will work in the end for my data. Thanks!!

written 16 months ago by Sakti370
