Question: How to summarize variant data in multiple VCF files?
2
gravatar for Jokhe
3.1 years ago by
Jokhe110
Sweden
Jokhe110 wrote:

I have analyzed normal-tumor samples with VarScan2 and now I have annotated variant data in VCF format;

##fileformat=VCFv4.1
##source=VarScan2
 ...
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
chr1    27107650    .   TA  T   .   PASS    DP=543;SS=1;SSC=3;GPV=1.5919E-53;SPV=4.6338E-1;ANNOVAR_DATE=2016-02-01;Func.refGene=UTR3;Gene.refGene=ARID1A;GeneDetail.refGene=NM_006015:c.*404delA,NM_139135:c.*404delA;ExonicFunc.refGene=.;AAChange.refGene=.;snp142=rs533673675;CLINSIG=.;CLNDBN=.;CLNACC=.;CLNDSDB=.;CLNDSDBID=.;cosmic70=.;ExAC_ALL=.;ExAC_AFR=.;ExAC_AMR=.;ExAC_EAS=.;ExAC_FIN=.;ExAC_NFE=.;ExAC_OTH=.;ExAC_SAS=.;ALLELE_END    GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:56:30:16:34.78%:30,0,16,0 0/1:.:487:232:135:36.78%:194,38,91,44
...

I would like to visualize my results with coMut plot. In order to do this I should somehow summarize my variant data which. I have succesfully generated coMut plots by summarizing data by hand with following data format;

Patient Gene   Effect  ...
A       APC    synonymous
A       BRCA1  synonymoys
B       BRCA2  frameshift deletion
B       KEAP1  nonsynonymous
C       MDM2   NA
C       PALB2  NA
...

However, it requires a huge amount of time to summarize data and I am now looking for approaches to do this by using command line tools and commands. Do you have any kind of suggestions how this kind of process could be done or is there some kind of tools which are able generate desired output files or output files with highly similar format?

I am going to do visualization by using ggplot2 package (R) and my only requirement (or wish) for output format is that format should be easily handled in R.

Thank you in advance!

ADD COMMENTlink modified 3.1 years ago by Jorge Amigo11k • written 3.1 years ago by Jokhe110

Not a complete answer, but by chance I saw something very similar to what you ask for made by this tool: https://github.com/griffithlab/GenVisR#waterfall-mutation-overview-graphic (scroll down a bit). Probably that can give you some pointers and ideas on how to go forward.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by WouterDeCoster42k
2
gravatar for Jorge Amigo
3.1 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

I'm not sure what is the summary rationale you may have, specially considering that you seem to be dealing with multisample vcf files (both normal and tumor). maybe you would like to filter first by novel tumor variants, and then parse the annotation in by the genetic function precedence. if that is the case, then you would need to code something similar to this:

echo -e "Patient\tGene\tEffect" > summary.txt
for file in *.vcf; do
perl -ne '
BEGIN { @funcs = ("frameshift insertion", "frameshift deletion", "frameshift block substitution", "stopgain", "stoploss", "nonframeshift insertion", "nonframeshift deletion", "nonframeshift block substitution", "nonsynonymous SNV", "synonymous SNV", "unknown", ".") }
if (/;Gene.refGene=([^;]+).+;ExonicFunc.refGene([^;]+).+\t(\d.\d):\S+/\t(\d.\d):/) {
  next if $3 eq $4; # remove this if you don't want to filter by novel tumor variants
  $geneFunc{$1}{$2} = 1;
} END {
  foreach $gene (sort keys %geneFunc) {
    foreach $func (@funcs) {
      if ( exists $geneFunc{$gene}{$func} ) {
        print "$gene\t$geneFunc{$gene}{$func}\n"; last;
}}}}' $file | sed "s/^/$file\t" >> summary.txt
done
ADD COMMENTlink written 3.1 years ago by Jorge Amigo11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2305 users visited in the last hour