Question

How to summarize variant data in multiple VCF files?

2

Entering edit mode

7.5 years ago

Jokhe ▴ 140

I have analyzed normal-tumor samples with VarScan2 and now I have annotated variant data in VCF format;

##fileformat=VCFv4.1
##source=VarScan2
 ...
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
chr1    27107650    .   TA  T   .   PASS    DP=543;SS=1;SSC=3;GPV=1.5919E-53;SPV=4.6338E-1;ANNOVAR_DATE=2016-02-01;Func.refGene=UTR3;Gene.refGene=ARID1A;GeneDetail.refGene=NM_006015:c.*404delA,NM_139135:c.*404delA;ExonicFunc.refGene=.;AAChange.refGene=.;snp142=rs533673675;CLINSIG=.;CLNDBN=.;CLNACC=.;CLNDSDB=.;CLNDSDBID=.;cosmic70=.;ExAC_ALL=.;ExAC_AFR=.;ExAC_AMR=.;ExAC_EAS=.;ExAC_FIN=.;ExAC_NFE=.;ExAC_OTH=.;ExAC_SAS=.;ALLELE_END    GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:56:30:16:34.78%:30,0,16,0 0/1:.:487:232:135:36.78%:194,38,91,44
...

I would like to visualize my results with coMut plot. In order to do this I should somehow summarize my variant data which. I have succesfully generated coMut plots by summarizing data by hand with following data format;

Patient Gene   Effect  ...
A       APC    synonymous
A       BRCA1  synonymoys
B       BRCA2  frameshift deletion
B       KEAP1  nonsynonymous
C       MDM2   NA
C       PALB2  NA
...

However, it requires a huge amount of time to summarize data and I am now looking for approaches to do this by using command line tools and commands. Do you have any kind of suggestions how this kind of process could be done or is there some kind of tools which are able generate desired output files or output files with highly similar format?

I am going to do visualization by using ggplot2 package (R) and my only requirement (or wish) for output format is that format should be easily handled in R.

Thank you in advance!

VCF VarScan2 coMut visualization summary • 3.4k views

ADD COMMENT • link updated 4.4 years ago by lincoln.harris • 0 • written 7.5 years ago by Jokhe ▴ 140

0

Entering edit mode

Not a complete answer, but by chance I saw something very similar to what you ask for made by this tool: https://github.com/griffithlab/GenVisR#waterfall-mutation-overview-graphic (scroll down a bit). Probably that can give you some pointers and ideas on how to go forward.

ADD REPLY • link 7.5 years ago by WouterDeCoster 47k

score 2 · Answer 1 · 2016-10-25

I'm not sure what is the summary rationale you may have, specially considering that you seem to be dealing with multisample vcf files (both normal and tumor). maybe you would like to filter first by novel tumor variants, and then parse the annotation in by the genetic function precedence. if that is the case, then you would need to code something similar to this:

echo -e "Patient\tGene\tEffect" > summary.txt
for file in *.vcf; do
perl -ne '
BEGIN { @funcs = ("frameshift insertion", "frameshift deletion", "frameshift block substitution", "stopgain", "stoploss", "nonframeshift insertion", "nonframeshift deletion", "nonframeshift block substitution", "nonsynonymous SNV", "synonymous SNV", "unknown", ".") }
if (/;Gene.refGene=([^;]+).+;ExonicFunc.refGene([^;]+).+\t(\d.\d):\S+/\t(\d.\d):/) {
  next if $3 eq $4; # remove this if you don't want to filter by novel tumor variants
  $geneFunc{$1}{$2} = 1;
} END {
  foreach $gene (sort keys %geneFunc) {
    foreach $func (@funcs) {
      if ( exists $geneFunc{$gene}{$func} ) {
        print "$gene\t$geneFunc{$gene}{$func}\n"; last;
}}}}' $file | sed "s/^/$file\t" >> summary.txt
done

score 0 · Answer 2 · 2019-12-13

If you want to summarize vcf entries on the gene level, and predict the peptide-level consequences of SNPs and Indels found in your sample, we have a tool called cerebra thats been designed for the task.

https://github.com/czbiohub/cerebra

It takes in any number of vcf files and produces a summary table that encodes gene, Ensemble Translation IDs and the amino-acid changes associated with each variant.