problems with MAF for MutSigCV (vcf2maf)
2
1
Entering edit mode
8.3 years ago
A. Domingues ★ 2.6k

I am trying to run MutSIgCV and got stuck with this error:

MutSigCV allsamples.md.tc.ir.br.pr.ug.dbsnp.vep.maf \
"$anno"exome_full192.coverage.txt \ "$anno"gene.covariates.txt \
my_results \
"$anno"mutation_type_dictionary_file.txt \ "$anno"chr_files_hg19

======================================
MutSigCV
v1.4

(c) Mike Lawrence and Gaddy Getz
Broad Institute of MIT and Harvard
======================================

MutSigCV: PREPROCESS
--------------------
Error using MutSigCV>MutSig_preprocess (line 246)
MutSig is not applicable to single patients.\n

Error in MutSigCV (line 184)


I suspect this is because I did not create the maf file properly. A run-down of the experiment:

• variants were called using the GATK pipeline (unified genotyper)
• annotated with VEP
• vcf was converted to maf using the tool vcf2maf

I am not working with cancer data.

Questions:

• is the issue due to the maf?
• If yes, how can it be prepared to include patient information?

A few of the vcf (header removed, this is a test so only chr14 is present):

##INFO=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Psio10_gDNA Psio11_gDNA Psio12_gDNA Psio1_gDNA Psio2_gDNA Psio3_gDNA Psio4_gDNA Psio5_gDNA Psio6_gDNA Psio7_gDNA Psio8_gDNA Psio9_gDNA
chr14 62290315 . G A 27.75 LowQual AC=4;AF=0.250;AN=16;BaseCounts=4,0,9,0;BaseQRankSum=-1.271;DP=13;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=3;MLEAF=0.188;MQ=50.39;MQ0=0;MQRankSum=0.480;QD=13.88;ReadPosRankSum=-1.733;CSQ=intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,69 0/0:2,0:2:6:0,6,68 ./. ./. ./. ./. 1/1:0,1:1:3:32,3,0 0/0:1,0:1:3:0,3,32 1/1:0,1:1:3:33,3,0 0/0:1,1:2:3:0,3,33 0/0:2,0:2:3:0,3,33 0/0:1,0:1:3:0,3,33
chr14 62292953 . A T 10.92 LowQual AC=2;AF=1.00;AN=2;BaseCounts=0,0,0,1;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=10.92;CSQ=intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AD:DP:GQ:PL ./. ./. 1/1:0,1:1:3:33,3,0 ./. ./. ./. ./. ./. ./. ./. ./. ./.
chr14 62293825 . G T 20.64 LowQual AC=2;AF=0.167;AN=12;BaseCounts=0,0,9,2;BaseQRankSum=-1.296;DP=11;Dels=0.00;FS=0.000;HaplotypeScore=0.1662;MLEAC=2;MLEAF=0.167;MQ=60.00;MQ0=0;MQRankSum=0.354;QD=4.13;ReadPosRankSum=0.354;CSQ=downstream_gene_variant|||ENSG00000258956|COX4I1P1|ENST00000554239|||||processed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AB:AD:DP:GQ:PL ./. 0/0:.:2,0:2:6:0,6,68 ./. ./. ./. ./. 0/0:.:1,0:1:3:0,3,33 0/1:0.670:2,1:3:26:26,0,39 ./. 0/0:.:2,0:2:6:0,6,68 0/1:0.500:1,1:2:27:27,0,27 0/0:.:1,0:1:3:0,3,33
chr14 62294744 . G A 28.05 LowQual AC=3;AF=0.300;AN=10;BaseCounts=2,1,5,0;BaseQRankSum=-0.198;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=3;MLEAF=0.300;MQ=57.05;MQ0=0;MQRankSum=0.198;QD=9.35;ReadPosRankSum=0.198;CSQ=downstream_gene_variant|||ENSG00000258956|COX4I1P1|ENST00000554239|||||processed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AB:AD:DP:GQ:PL ./. ./. 1/1:.:0,1:1:3:33,3,0 ./. 0/0:.:1,0:1:3:0,3,33 0/1:0.500:1,1:2:26:26,0,26 ./. 0/0:.:1,0:1:3:0,3,32 0/0:.:2,0:2:3:0,3,45 ./. ./. ./.
chr14 62299005 . A C 144.99 . AC=1;AF=0.042;AN=24;BaseCounts=301,8,0,0;BaseQRankSum=-4.723;DP=309;Dels=0.00;FS=2.199;HaplotypeScore=0.9147;InbreedingCoeff=-0.0435;MLEAC=1;MLEAF=0.042;MQ=59.70;MQ0=0;MQRankSum=0.508;QD=7.63;ReadPosRankSum=-1.488;CSQ=non_coding_exon_variant&nc_transcript_variant|||ENSG00000258956|COX4I1P1|ENST00000554239|1/1||||processed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AB:AD:DP:GQ:PL 0/0:.:26,0:26:78:0,78,892 0/0:.:29,0:29:81:0,81,927 0/0:.:17,0:17:45:0,45,528 0/0:.:28,0:28:75:0,75,861 0/0:.:25,0:25:69:0,69,777 0/0:.:35,0:35:99:0,102,1127 0/0:.:38,0:38:99:0,102,1149 0/1:0.580:11,8:19:99:180,0,294 0/0:.:24,0:24:66:0,66,764 0/0:.:33,0:33:93:0,93,1065 0/0:.:16,0:16:45:0,45,502 0/0:.:19,0:19:54:0,54,603

vcf2maf snp vcf MutSigCV gatk • 9.1k views
10
Entering edit mode
8.3 years ago

vcf2maf doesn't support multi-sample VCFs, like the sample you've shown. When --tumor-id is not specified, the resulting MAF will name all the samples as TUMOR... which is why MutSig thinks you have a single patient. Even though you're not working with cancer, it is perfectly fine to set --tumor-id to your sample ID, as long as it matches the corresponding column header in the VCF e.g. Psio10_gDNA. But you will first need to split your VCF into per-sample VCFs, run vcf2maf on each, concatenate the resulting MAFs, and feed that into MutSig. Here is a bit of code to help out:

# Download and unpack VCFtools. You don't need to compile it, because you'll only need the Perl utils for now:
tar -zxf vcftools_0.1.12b.tar.gz

# Use VCFtools' vcf-query to make a list of all the sample IDs in the multisample VCF:
perl -I vcftools_0.1.12b/perl vcftools_0.1.12b/perl/vcf-query --list-columns allsamples.vcf > sample_ids

# For each sample ID, run vcf-subset to create per-sample VCFs in a subfolder:
mkdir vcf2maf
cat sample_ids | perl -ne 'chomp; print cat allsamples.vcf | perl -I vcftools_0.1.12b/perl vcftools_0.1.12b/perl/vcf-subset --exclude-ref --columns $_ > vcf2maf/$_.vcf'

# For each VCF, run vcf2maf with the --tumor-id specified, to create per-sample MAFs into the subfolder:
cat sample_ids | perl -ne 'chomp; print perl vcf2maf.pl --input-vcf vcf2maf/$_.vcf --output-maf vcf2maf/$_.vep.maf --tumor-id \$_'

# Concatenate the per-sample MAFs together, making sure that the MAF header is not duplicated:
cat vcf2maf/*.maf | egrep "^#|^Hugo_Symbol" | head -2 > allsamples.vep.maf
cat vcf2maf/*.maf | egrep -v "^#|^Hugo_Symbol" >> allsamples.vep.maf


In addition to MutSig, you can also try the SMG test from the MuSiC suite of tools developed at my former lab. It is not cancer specific, and can also find significantly altered regulatory regions, based on how you define your regions of interest.

2
Entering edit mode

Thanks for the reply Cyriac. I was going to post your email an an answer tomorrow, but since you have done it I can't give you the rep of an accepted answer :)

One thing that I noticed while doing the conversion: For the purposes of running MutSigCV, vcf2maf needs to be run with the option --input-vcf, and vep (annotations version 74) must in the path. I was running a previously annotated vcf (using --input-vep) and the annotations were lost during the conversion.

As for Music, I will definitely give a go.

1
Entering edit mode

Thanks for the upvote! VEP can be run with many different options, and I haven't tested vcf2maf with all the different combinations that you can feed into --input-vep. It is always safest to let vcf2maf run VEP itself, with the runtime parameters it uses.

0
Entering edit mode

Hello Cyriac! Thank you so much for your work and such detailed explanation, another upvote;) I am not sure if that's correct to ask here but it's related to the topic and particularly to the @fridaymeetssunday last comment. I am also trying to convert my vcf's to maf for MutSigCV. And the problem is: my data are in vcf 4.1 format and apparently VEP only accept vcf 4.0. Didn't you come across such cases? I was trying different combinations like annotating with snpEff both via vcf2maf and separately but the annotation info is always lost in the conversion. And the only thing I could do with VEP is to input data in VEP default format but then the output is not vcf either... Thank you!

0
Entering edit mode

Can you post a new question on Biostars with the problematic VCF line? There is not much different between VCF 4.0 and 4.1, except maybe in the support for structural variants. There's only so much VEP can do with structural variants.

0
Entering edit mode

Hello, can I annotate my vcf with VEP by hand first then vcf2maf skip annotation? Because I can run VEP annotate, but vcf2maf.pl will fail at running VEP annotation stage. Maybe the reason is using different perl but I was under same env and PERL5LIB.

1
Entering edit mode
8.3 years ago
poisonAlien ★ 3.1k

I am not sure how MutSig would be helpful to you, given you are not working on Cancer data. Also error clearly states that MutSig is not applicable to single patients. MutSig works on somatic variations from multiple samples to identify possible drivers, so you need to have multiple patients/samples. If you have but not included them in maf, you can do so by adding patient/sample name to the column Tumor_Sample_Barcode.

0
Entering edit mode

@poisonAlien you are right I am not working with cancer samples, but I am working with a complex disease and I have data for several patients. So when looking for possible drivers of a disease, whether cancer or not, it should be applicable (if I understood the paper correctly).

And thanks for the tip regarding the Tumor_Sample_Barcode column.