Question: problems with MAF for MutSigCV (vcf2maf)
1
gravatar for A. Domingues
3.3 years ago by
A. Domingues1.5k
Mainz, Germany
A. Domingues1.5k wrote:

I am trying to run MutSIgCV and got stuck with this error:
###
MutSigCV allsamples.md.tc.ir.br.pr.ug.dbsnp.vep.maf \
"$anno"exome_full192.coverage.txt \
"$anno"gene.covariates.txt \
my_results \
"$anno"mutation_type_dictionary_file.txt \
"$anno"chr_files_hg19

======================================
MutSigCV
v1.4

(c) Mike Lawrence and Gaddy Getz
Broad Institute of MIT and Harvard
======================================

MutSigCV: PREPROCESS
--------------------
Loading mutation_file...
Error using MutSigCV>MutSig_preprocess (line 246)
MutSig is not applicable to single patients.\n

Error in MutSigCV (line 184)
######

I suspect this is because I did not create the maf file properly. A run-down of the experiment:
- variants were called using the GATK pipeline (unified genotyper)
- annotated with VEP
- vcf was converted to maf using the tool vcf2maf (https://github.com/ckandoth/vcf2maf)

I am not working with cancer data.

Questions:

  • is the issue due to the maf?
  • If yes, how can it be prepared to include patient information?

A few of the vfc (header removed, this is a test so only chr14 is present):

##INFO=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Psio10_gDNA Psio11_gDNA Psio12_gDNA Psio1_gDNA Psio2_gDNA Psio3_gDNA Psio4_gDNA Psio5_gDNA Psio6_gDNA Psio7_gDNA Psio8_gDNA Psio9_gDNA
chr14 62290315 . G A 27.75 LowQual AC=4;AF=0.250;AN=16;BaseCounts=4,0,9,0;BaseQRankSum=-1.271;DP=13;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=3;MLEAF=0.188;MQ=50.39;MQ0=0;MQRankSum=0.480;QD=13.88;ReadPosRankSum=-1.733;CSQ=intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,69 0/0:2,0:2:6:0,6,68 ./. ./. ./. ./. 1/1:0,1:1:3:32,3,0 0/0:1,0:1:3:0,3,32 1/1:0,1:1:3:33,3,0 0/0:1,1:2:3:0,3,33 0/0:2,0:2:3:0,3,33 0/0:1,0:1:3:0,3,33
chr14 62292953 . A T 10.92 LowQual AC=2;AF=1.00;AN=2;BaseCounts=0,0,0,1;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=10.92;CSQ=intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AD:DP:GQ:PL ./. ./. 1/1:0,1:1:3:33,3,0 ./. ./. ./. ./. ./. ./. ./. ./. ./.
chr14 62293825 . G T 20.64 LowQual AC=2;AF=0.167;AN=12;BaseCounts=0,0,9,2;BaseQRankSum=-1.296;DP=11;Dels=0.00;FS=0.000;HaplotypeScore=0.1662;MLEAC=2;MLEAF=0.167;MQ=60.00;MQ0=0;MQRankSum=0.354;QD=4.13;ReadPosRankSum=0.354;CSQ=downstream_gene_variant|||ENSG00000258956|COX4I1P1|ENST00000554239|||||processed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AB:AD:DP:GQ:PL ./. 0/0:.:2,0:2:6:0,6,68 ./. ./. ./. ./. 0/0:.:1,0:1:3:0,3,33 0/1:0.670:2,1:3:26:26,0,39 ./. 0/0:.:2,0:2:6:0,6,68 0/1:0.500:1,1:2:27:27,0,27 0/0:.:1,0:1:3:0,3,33
chr14 62294744 . G A 28.05 LowQual AC=3;AF=0.300;AN=10;BaseCounts=2,1,5,0;BaseQRankSum=-0.198;DP=8;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=3;MLEAF=0.300;MQ=57.05;MQ0=0;MQRankSum=0.198;QD=9.35;ReadPosRankSum=0.198;CSQ=downstream_gene_variant|||ENSG00000258956|COX4I1P1|ENST00000554239|||||processed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AB:AD:DP:GQ:PL ./. ./. 1/1:.:0,1:1:3:33,3,0 ./. 0/0:.:1,0:1:3:0,3,33 0/1:0.500:1,1:2:26:26,0,26 ./. 0/0:.:1,0:1:3:0,3,32 0/0:.:2,0:2:3:0,3,45 ./. ./. ./.
chr14 62299005 . A C 144.99 . AC=1;AF=0.042;AN=24;BaseCounts=301,8,0,0;BaseQRankSum=-4.723;DP=309;Dels=0.00;FS=2.199;HaplotypeScore=0.9147;InbreedingCoeff=-0.0435;MLEAC=1;MLEAF=0.042;MQ=59.70;MQ0=0;MQRankSum=0.508;QD=7.63;ReadPosRankSum=-1.488;CSQ=non_coding_exon_variant&nc_transcript_variant|||ENSG00000258956|COX4I1P1|ENST00000554239|1/1||||processed_pseudogene,intron_variant&nc_transcript_variant|||ENSG00000258882|CTD-2277K2.1|ENST00000554138|||||lincRNA GT:AB:AD:DP:GQ:PL 0/0:.:26,0:26:78:0,78,892 0/0:.:29,0:29:81:0,81,927 0/0:.:17,0:17:45:0,45,528 0/0:.:28,0:28:75:0,75,861 0/0:.:25,0:25:69:0,69,777 0/0:.:35,0:35:99:0,102,1127 0/0:.:38,0:38:99:0,102,1149 0/1:0.580:11,8:19:99:180,0,294 0/0:.:24,0:24:66:0,66,764 0/0:.:33,0:33:93:0,93,1065 0/0:.:16,0:16:45:0,45,502 0/0:.:19,0:19:54:0,54,603

mutsigcv vcf2maf snp gatk vcf • 3.8k views
ADD COMMENTlink modified 3.3 years ago by Cyriac Kandoth4.7k • written 3.3 years ago by A. Domingues1.5k
5
gravatar for Cyriac Kandoth
3.3 years ago by
Cyriac Kandoth4.7k
Memorial Sloan Kettering, New York, USA
Cyriac Kandoth4.7k wrote:

vcf2maf doesn't support multi-sample VCFs, like the sample you've shown. When --tumor-id is not specified, the resulting MAF will name all the samples as TUMOR... which is why MutSig thinks you have a single patient. Even though you're not working with cancer, it is perfectly fine to set --tumor-id to your sample ID, as long as it matches the corresponding column header in the VCF e.g. Psio10_gDNA. But you will first need to split your VCF into per-sample VCFs, run vcf2maf on each, concatenate the resulting MAFs, and feed that into MutSig. Here is a bit of code to help out:

# Download and unpack VCFtools. You don't need to compile it, because you'll only need the Perl utils for now:

wget http://downloads.sourceforge.net/project/vcftools/vcftools_0.1.12b.tar.gz
tar -zxf vcftools_0.1.12b.tar.gz

# Use VCFtools' vcf-query to make a list of all the sample IDs in the multisample VCF:

perl -I vcftools_0.1.12b/perl vcftools_0.1.12b/perl/vcf-query --list-columns allsamples.vcf > sample_ids

# For each sample ID, run vcf-subset to create per-sample VCFs in a subfolder:

mkdir vcf2maf
cat sample_ids | perl -ne 'chomp; print `cat allsamples.vcf | perl -I vcftools_0.1.12b/perl vcftools_0.1.12b/perl/vcf-subset --exclude-ref --columns $_ > vcf2maf/$_.vcf`'

# For each VCF, run vcf2maf with the --tumor-id specified, to create per-sample MAFs into the subfolder:

cat sample_ids | perl -ne 'chomp; print `perl vcf2maf.pl --input-vcf vcf2maf/$_.vcf --output-maf vcf2maf/$_.vep.maf --tumor-id $_`'

# Concatenate the per-sample MAFs together, making sure that the MAF header is not duplicated:

cat vcf2maf/*.maf | egrep "^#|^Hugo_Symbol" | head -2 > allsamples.vep.maf
cat vcf2maf/*.maf | egrep -v "^#|^Hugo_Symbol" >> allsamples.vep.maf

In addition to MutSig, you can also try the SMG test from the MuSiC suite of tools developed at my former lab. It is not cancer specific, and can also find significantly altered regulatory regions, based on how you define your regions of interest.

ADD COMMENTlink modified 17 months ago • written 3.3 years ago by Cyriac Kandoth4.7k
2

Thanks for the reply Cyriac. I was going to post your email an an answer tomorrow, but since you have done it I can't give you the rep of an accepted answer :)

One thing that I noticed while doing the conversion: For the purposes of running MutSigCV, vcf2maf needs to be run with the option --input-vcf, and vep (annotations version 74) must in the path. I was running a previously annotated vcf (using --input-vep) and the annotations were lost during the conversion.

As for Music, I will definitely give a go.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by A. Domingues1.5k
1

Thanks for the upvote! VEP can be run with many different options, and I haven't tested vcf2maf with all the different combinations that you can feed into --input-vep. It is always safest to let vcf2maf run VEP itself, with the runtime parameters it uses.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Cyriac Kandoth4.7k

Hello Cyriac! Thank you so much for your work and such detailed explanation, another upvote;) I am not sure if that's correct to ask here but it's related to the topic and particularly to the @ fridaymeetssunday last comment. I am also trying to convert my vcf's to maf for MutSigCV. And the problem is: my data are in vcf 4.1 format and apparently VEP only accept vcf 4.0. Didn't you come across such cases? I was trying different combinations like annotating with snpEff both via vcf2maf and separately but the annotation info is always lost in the conversion. And the only thing I could do with VEP is to input data in VEP default format but then the output is not vcf either... Thank you! 

ADD REPLYlink written 2.6 years ago by Eugenie0

Can you post a new question on Biostars with the problematic VCF line? There is not much different between VCF 4.0 and 4.1, except maybe in the support for structural variants. There's only so much VEP can do with structural variants.

ADD REPLYlink written 2.6 years ago by Cyriac Kandoth4.7k
1
gravatar for poisonAlien
3.3 years ago by
poisonAlien2.5k
Asgard
poisonAlien2.5k wrote:

I am not sure how MutSig would be helpful to you, given you are not working on Cancer data. Also error clearly states that  MutSig is not applicable to single patients. MutSig works on somatic variations from multiple samples to identify possible drivers, so you need to have multiple patients/samples. If you have but not included them in maf, you can do so by adding patient/sample name to the column 'Tumor_Sample_Barcode'.

ADD COMMENTlink written 3.3 years ago by poisonAlien2.5k

@poisionAlien you are right I am not working with cancer samples, but I am working with a complex disease and I have data for several patients. So when looking for possible drivers of a disease, whether cancer or not, it should be applicable (if I understood the paper correctly).

And thanks for the tip regarding the "Tumor_Sample_Barcode" column.

 

ADD REPLYlink written 3.3 years ago by A. Domingues1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 607 users visited in the last hour