First off, I'm not an expert statistician, so bear with me please. I've been using MuSiC in my cancer analysis and I was wondering what was the point of clustering samples with similar BMRs together at the calc-bmr stage?
The default is to cluster all BMRs together, I would have expected the default to be the opposite, keep all samples individually.
Is there a statistical reason to cluster, or is it just to make computation faster? Or, re-phrased, which would make the most sens if computation time is not an issue, to cluster or not?
It is quite a bit faster to cluster everything together.
Thanks
It makes a lot of sens. But then the question is, how many clusters should you use? I guess you need to run the bmr part twice. One for each sample individually. Then inspect the bmr distribution, then rerun it with what you think is an acceptable cluster count. It would be nice to have MuSiC plot the bmr distributions of samples see if clusters appear.
Yea, that's a good idea. But you don't really need to run the
calc-bmr
step for each sample individually.calc-covg
will output a file calledtotal_covgs
which has the number of covered bps per-sample. Use these as your denominators, and per-sample mutation counts from the MAF as your numerators, to measure mutations per Mbp for each sample.total_covgs
also lists the number of covered bps at AT, CG, CpG sites (per the reference sequence), if you're feeling adventurous.We have a ton of visualizations we want to implement, and a per-sample mutation frequency plot is on the top of the list.
You can also ignore coverage and simply enumerate the number of variants per-sample in the MAF. Something like this:
cut -f 16 tcga.maf | sort | uniq -c | sort -rn
. Dump that into R, and plot the distribution. If there are few outliers, you might want to either exclude them, or use the BMR clustering feature with--bmr-groups
. If all samples have comparable variant counts, then there is no real reason to use the clustering feature.