Question

Music Bmr Clustering...Why?

1

Entering edit mode

11.1 years ago

Louis Letourneau ▴ 820

First off, I'm not an expert statistician, so bear with me please. I've been using MuSiC in my cancer analysis and I was wondering what was the point of clustering samples with similar BMRs together at the calc-bmr stage?

The default is to cluster all BMRs together, I would have expected the default to be the opposite, keep all samples individually.

Is there a statistical reason to cluster, or is it just to make computation faster? Or, re-phrased, which would make the most sens if computation time is not an issue, to cluster or not?

It is quite a bit faster to cluster everything together.

Thanks

music clustering cancer • 2.3k views

ADD COMMENT • link updated 11.1 years ago by Cyriac Kandoth 6.0k • written 11.1 years ago by Louis Letourneau ▴ 820

score 3 · Answer 1 · 2013-04-04

3

Entering edit mode

11.1 years ago

Cyriac Kandoth 6.0k

Yes, the computation is much faster with samples clustered, but much more importantly, the default purpose of MuSiC is in cancer, where somatic mutations are usually very sparse... and we don't have enough to measure a decent BMR per-sample. However, if you are dealing with ultra-mutated tumors like seen in Melanoma, Colorectal, Endometrial, etc., then you might get decent results with separate clusters of samples, or all samples individually. A large list of germline variants (preferably rare variants) might also be a good test-case, though I've never tried that.

ADD COMMENT • link 11.1 years ago by Cyriac Kandoth 6.0k

1

Entering edit mode

It makes a lot of sens. But then the question is, how many clusters should you use? I guess you need to run the bmr part twice. One for each sample individually. Then inspect the bmr distribution, then rerun it with what you think is an acceptable cluster count. It would be nice to have MuSiC plot the bmr distributions of samples see if clusters appear.

ADD REPLY • link 11.1 years ago by Louis Letourneau ▴ 820

0

Entering edit mode

Yea, that's a good idea. But you don't really need to run the calc-bmr step for each sample individually. calc-covg will output a file called total_covgs which has the number of covered bps per-sample. Use these as your denominators, and per-sample mutation counts from the MAF as your numerators, to measure mutations per Mbp for each sample. total_covgs also lists the number of covered bps at AT, CG, CpG sites (per the reference sequence), if you're feeling adventurous.

We have a ton of visualizations we want to implement, and a per-sample mutation frequency plot is on the top of the list.

ADD REPLY • link 11.1 years ago by Cyriac Kandoth 6.0k

0

Entering edit mode

You can also ignore coverage and simply enumerate the number of variants per-sample in the MAF. Something like this: cut -f 16 tcga.maf | sort | uniq -c | sort -rn. Dump that into R, and plot the distribution. If there are few outliers, you might want to either exclude them, or use the BMR clustering feature with --bmr-groups. If all samples have comparable variant counts, then there is no real reason to use the clustering feature.

ADD REPLY • link 11.0 years ago by Cyriac Kandoth 6.0k