Hi,
I am trying to use MuSiC to analyse mutation rates in novel, non-coding genes. I am able to successfully run the relevant commands in MuSiC and the coverage statistics look correct, but the results show no mutations in any genes (which I know isn't true). My guess is that there is probably some formatting issue with the .maf file containing somatic mutations, which is causing the output of the "bmr calc-bmr" to be inaccurate.
Here are the first few lines of my .maf file
#version 2.3
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID
Unknown 0 genome.wustl.edu GRCh37-lite 1 322115 322115 + Targeted_Region SNP G A G NA NA TCGA-E2-A15K TCGA-E2-A15K G G NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Unknown 0 genome.wustl.edu GRCh37-lite 1 328193 328193 + Targeted_Region SNP A A G NA NA TCGA-E2-A15K TCGA-E2-A15K A A NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Unknown 0 genome.wustl.edu GRCh37-lite 1 384901 384901 + Targeted_Region SNP G A G NA NA TCGA-E2-A15K TCGA-E2-A15K G G NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Unknown 0 genome.wustl.edu GRCh37-lite 1 390657 390657 + Targeted_Region SNP A A G NA NA TCGA-E2-A15K TCGA-E2-A15K A A NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Unknown 0 genome.wustl.edu GRCh37-lite 1 404577 404577 + Targeted_Region SNP G A G NA NA TCGA-E2-A15K TCGA-E2-A15K G G NA NA NA NA Unknown Unknown Somatic PhaseI WGS No NA NA Illumina f289e8b7-68db-48b9-8dcc-1349269eb54b c24945be-a051-4797-b7e6-09b32396f354
Here are the music commands that I am using:
genome music bmr calc-covg --bam-list /path/to/bam.list --output-dir /path/to/output_folder --reference-sequence /path/to/GRCh37-lite.fa --roi-file /path/to/gene_coordinates.bed
genome music bmr calc-bmr --bam-list /tcga/users/cdwarden/wgs/BRCA/MuSiC/bam.list --maf-file /path/to/somatic.maf --output-dir /path/to/output_folder --reference-sequence /path/to/GRCh37-lite.fa --roi-file /path/to/gene_coordinates.bed
genome music smg --gene-mr-file /path/to/gene_mrs --output-file /path/to/smgs
I have also tried adding the transcript ID to the first mutation in the .maf file (so that I would expect to see one mutation in the smgs_detailed file), but that gene still is reported to have 0 mutations.
Can you please help me troubleshoot this issue?
Thanks,
Charles
I think its because Hugo_Symbols are Unknown in your maf file.
I changed the transcript ID for the first mutation to match the corresponding gene, and that gene was still reported to not have any mutations. Also, I used "Unknown" (instead of NA, etc.) because that is what I thought the .maf format required for such genes.
Is there something else that should be changed besides "Unknown"?
I have used this program a while back, and what I understand is, the gene names in maf file must match the gene names in your roi file, which you use for
calc-covgfunction. Also, it will skip all those silent variants inVariant_Classificationcolumn ; unless you mention not skip so. In your example, I see that most of the variants haveVariant_Classificationset toUnknown, which might be the one reason.This is correct. The Hugo_Symbol needs to be properly defined. These calls seem to be annotated incorrectly as
Targeted_Region, which is something that MuSiC skips as intergenic. Considering that the MAF saysWGS, these might be legitimately intergenic calls. Check in a genome browser.Yes - I want to characterize mutation rates in ncRNAs (most of which will not be covered in exome designs, and many of which are novel).
What would you recommend for the Variant_Classification and Variant_Type, in this situation?
You can refer to the documentation here. When you run
music bmr calc-bmr, enable the option--noskip-non-coding. You'll still need to annotate each variant with a symbol that it can match back to a region in your ROI file. MAF format is not as detailed in distinguishing between ncRNA types.Variant_Classificationwill always sayRNA. But name the genes differently using annotators like VEP, and you should be fine. Have you tried the maf2maf tool?Thank you very much !!
This is also something i wonder how to prioritize such intergenic/intronic SNVs.