Question: How to summarize variant data in multiple VCF files?
gravatar for Jokhe
3.9 years ago by
Jokhe120 wrote:

I have analyzed normal-tumor samples with VarScan2 and now I have annotated variant data in VCF format;

chr1    27107650    .   TA  T   .   PASS    DP=543;SS=1;SSC=3;GPV=1.5919E-53;SPV=4.6338E-1;ANNOVAR_DATE=2016-02-01;Func.refGene=UTR3;Gene.refGene=ARID1A;GeneDetail.refGene=NM_006015:c.*404delA,NM_139135:c.*404delA;ExonicFunc.refGene=.;AAChange.refGene=.;snp142=rs533673675;CLINSIG=.;CLNDBN=.;CLNACC=.;CLNDSDB=.;CLNDSDBID=.;cosmic70=.;ExAC_ALL=.;ExAC_AFR=.;ExAC_AMR=.;ExAC_EAS=.;ExAC_FIN=.;ExAC_NFE=.;ExAC_OTH=.;ExAC_SAS=.;ALLELE_END    GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:56:30:16:34.78%:30,0,16,0 0/1:.:487:232:135:36.78%:194,38,91,44

I would like to visualize my results with coMut plot. In order to do this I should somehow summarize my variant data which. I have succesfully generated coMut plots by summarizing data by hand with following data format;

Patient Gene   Effect  ...
A       APC    synonymous
A       BRCA1  synonymoys
B       BRCA2  frameshift deletion
B       KEAP1  nonsynonymous
C       MDM2   NA
C       PALB2  NA

However, it requires a huge amount of time to summarize data and I am now looking for approaches to do this by using command line tools and commands. Do you have any kind of suggestions how this kind of process could be done or is there some kind of tools which are able generate desired output files or output files with highly similar format?

I am going to do visualization by using ggplot2 package (R) and my only requirement (or wish) for output format is that format should be easily handled in R.

Thank you in advance!

ADD COMMENTlink modified 9 months ago by lincoln.harris0 • written 3.9 years ago by Jokhe120

Not a complete answer, but by chance I saw something very similar to what you ask for made by this tool: (scroll down a bit). Probably that can give you some pointers and ideas on how to go forward.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by WouterDeCoster44k
gravatar for Jorge Amigo
3.9 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

I'm not sure what is the summary rationale you may have, specially considering that you seem to be dealing with multisample vcf files (both normal and tumor). maybe you would like to filter first by novel tumor variants, and then parse the annotation in by the genetic function precedence. if that is the case, then you would need to code something similar to this:

echo -e "Patient\tGene\tEffect" > summary.txt
for file in *.vcf; do
perl -ne '
BEGIN { @funcs = ("frameshift insertion", "frameshift deletion", "frameshift block substitution", "stopgain", "stoploss", "nonframeshift insertion", "nonframeshift deletion", "nonframeshift block substitution", "nonsynonymous SNV", "synonymous SNV", "unknown", ".") }
if (/;Gene.refGene=([^;]+).+;ExonicFunc.refGene([^;]+).+\t(\d.\d):\S+/\t(\d.\d):/) {
  next if $3 eq $4; # remove this if you don't want to filter by novel tumor variants
  $geneFunc{$1}{$2} = 1;
} END {
  foreach $gene (sort keys %geneFunc) {
    foreach $func (@funcs) {
      if ( exists $geneFunc{$gene}{$func} ) {
        print "$gene\t$geneFunc{$gene}{$func}\n"; last;
}}}}' $file | sed "s/^/$file\t" >> summary.txt
ADD COMMENTlink written 3.9 years ago by Jorge Amigo12k
gravatar for lincoln.harris
9 months ago by
Chan Zuckerberg Biohub, San Francisco
lincoln.harris0 wrote:

If you want to summarize vcf entries on the gene level, and predict the peptide-level consequences of SNPs and Indels found in your sample, we have a tool called cerebra thats been designed for the task.

It takes in any number of vcf files and produces a summary table that encodes gene, Ensemble Translation IDs and the amino-acid changes associated with each variant.

ADD COMMENTlink modified 3 months ago • written 9 months ago by lincoln.harris0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1478 users visited in the last hour