Are there recommended steps if MuSiC reports too many significantly mutated genes
2
5
Entering edit mode
6.5 years ago
Collin ▴ 1000

I tried testing MuSiC (v0.4) but have gotten substantially more significantly mutated genes then expected. For my own pan-cancer dataset it returned thousands of genes, and for a testing set of published Ovarian somatic mutations returned ~350. I used the default settings for these runs using calc-wig-covg, calc-bmr, and smg (so I didn't need the BAM files). I obtained the ovarian MAF file from synapse (https://www.synapse.org/#!Synapse:syn1729383 ), coverage wig files from firehose (recommended on this post), and recommended ROI file (here). Is there anything I'm missing or are there parameter tweaks or changes so MuSiC reports a more reasonable number of significantly mutated genes?

I saw a couple of parameters that might be helpful. One was the --bmr-groups option in genome music bmr calc-bmr, which appears to group samples into a certain number of similarly mutated groups. Is there a recommended way to set up the number of BMR groups? Another was the --bmr-modifier-file option in genome music smg as a multiplication factor for the background mutation rate for certain genes. Is there a standard/recommended BMR modifier file?

music • 2.2k views
4
Entering edit mode
6.5 years ago

MuSiC's SMG test is very sensitive to false-positive mutations, especially recurrent ones like germline calls or alignment artifacts. So use super strict filters on your VCFs with this tool, and annotate with vcf2maf for common_variant tags based on ExAC.

Genes in genomic regions that are somehow protected from somatic alteration, are inevitably mutated in hypermutated tumor types like endometrial, melanoma, colorectal, etc. Since we're testing for "genes that are mutated more often than the background mutation rate (BMR)", such genes show up as significantly mutated genes (SMGs). So another quick fix is to simply exclude hypermutated samples from the MAF, say anything with more than 200 mutations. And use a stricter --max-fdr for cancer types with higher overall BMRs. The 20% default was meant for AML, the TCGA cohort with the lowest BMRs.

If your results don't get better with the steps above, then read on below, to understand what's going on under the hood.

MuSiC's calc-bmr categorizes BMRs for transitions/transversions, CpG, CG, AT sites, indels, that smg will test each gene's categorized MRs against. But we know how these vary a lot between cancer types. There's a mutation spectra from smoking, and another from UV, another associated with altered mismatch repair genes like POLE, etc. So it makes sense to build different BMRs for different subsets of your pan-cancer dataset. A simple solution is to run MuSiC separately per cancer type. Or you could split samples by exposure type (e.g. smoking/UV), histologic subtype (e.g. lobular/ductal), mutation signatures (e.g. Alexandrov et al.), etc. We have to compromise on having enough mutations to build a decent representative BMR per category/subcohort. There are some arguments in MuSiC you can use to control such compromises. See the documentation for --bmr-groups in calc-bmr. And --bmr-modifier-file for the smg test.

P.S. I'm surprised it took 4 years for this question to come up on biostars! I know collaboration can be a bottleneck, but it doesn't hurt to ask. Kudos Collin.

0
Entering edit mode

Thank you for your informative response. Yes, my pan-cancer hypermutator filter was 1000 mutations, which is higher than your 200. Actually the one parameter I did change was max fdr to .1 (common choice in cancer sequencing studies), but it looks like for the above reasons that expected false positives taken literally from the definition of the estimated FDR (false discovery rate) should be taken with a "grain of salt". I did filter out variants with read mappability warnings, but didn't do a filter on allele frequency. Do you have an intuition on the frequency which actual germline variants pollute called somatic variants in published studies? Unfortunately since I'm doing a comparative analysis of methods, it only makes sense to move the FDR threshold together with all methods and evaluate the same set of pan-cancer mutations (as a whole, instead of broken up). I'm well aware that estimating mutation rates are fickle things (even in human evolution). In some aspects, perhaps, accurately understanding the uncertainty of estimates for mutation rates is almost more important than getting a single good point estimate.

0
Entering edit mode

Correct. The FDR can't be used in the traditional sense, because the regional differences in BMRs add too much noise in the range/rank of per-gene p-values... or something like that.

ExAC is fairly recent, so a lot of previously generated somatic mutation lists like from TCGA/ICGC did not a decent panel-of-normals for germline filtering. Any kind of uneven coverage or allele-specific amplification bias, can make a germline variant look like it's somatic.

0
Entering edit mode

From the MuSiC paper, it seems the convolution test is the most preferred, but an SMG is called when it is significant in at least 2 of 3 tests. In your opinion, do you think using solely the most conservative p-value method might be reasonable in my scenario (typically FCPT)? I also plan to do some parameter testing on the bmr groups.

0
Entering edit mode

Yea, you could try only FCPT, but it has horrible sensitivity. If you're comparing methods... then you can report results separately for "MuSiC FCPT", "MuSiC CT", "MuSiC 2of3"... something like that

0
Entering edit mode

I asked because FCPT in the ovarian test data (which the synapse entry seems to basically match the suggested mutation filters) reports like ~50 genes down from ~350 for 2 of 3. Both the convolution test and LRT reported 355 and 495, respectively. So basically since both LRT and CT seem to be driving up the number of significant genes, the 2 of 3 scenario is still reporting many. I'm not against reporting "2of3" or "CT" or etc., I'm just trying to see if I can get the "best" parameterization that is consistent for all comparison evaluations.

0
Entering edit mode
6.5 years ago
H.Hasani ▴ 990

Hi,

I see that you did not get an answer till now.

I'm facing similar problem with MutSigCV, therefore, I'm handling it before and after using the tool, i.e. filter my input in advance before calling, then afterwards. Moreover, I'm using another tools, so that I can compare the results too!

I do not know if that's the kind of answer you were hoping for, but I would be more than happy to hear new opinions!

Bests,

0
Entering edit mode

I'm trying to get both to run well, and hopefully compare them. Yes, I had a similar problem with MutSigCV when I had used the recommended exome full coverage file. Since my data was comprised of several published studies that only report coding mutations, I had to mark all the "noncoding" coverage to be zero. That resulted in me getting a reasonable number of significantly mutated genes. So I know it's possible that even with my current data that a significantly mutated gene method can report a reasonable result. But this problem doesn't seem like it's the case for MuSiC since I used the exact coverage information from firehose for testing on the ovarian data. Biostars is the official forum for MuSiC, hopefully some of the MuSiC developers might respond.