Tool: VCF-simplify: a VCF simplification tool.
3
gravatar for kirannbishwa01
3 months ago by
United States
kirannbishwa01810 wrote:

VCF-simplify.v2

A python parser to simplify the vcf file into table like format. https://github.com/everestial/VCF-simplify


There are several tools available to mainpulate and alter VCF file. But, a simple and comprehensive tool that can produce a most simple output required by emperical biologist is still amiss.

This tool takes in sorted vcf file and reports a simplified table output for INFO and FORMAT field for each SAMPLE of interest. With default state (minimal code) all the INFO, FORMAT for all the SAMPLE are simplified. Fields can be further narrowed down using very convenient and comprehensive scripts. See the examples given below.

The output table can be created in both "long" and "wide" format, which makes it suitable for mining data by samples vs position quite simple. The output can be further filtered downstream with awk and can be loaded onto R and used with tidyr, dplyr where different columns can be accessed by matching names or pre,suf - fixes.

Prerequisites:

Python packages and modules:


Usage (using the given input test data):

Call for available options

python3 vcf_simplify-v2.py --help


If no options are provided then all the INFO, FORMAT fields are reported from all the SAMPLE

python3 vcf_simplify-v2.py --vcf input_test.vcf --out simplified_vcf.txt


Report wide output and "GT" as nucleotide bases

python3 vcf_simplify-v2.py --vcf input_test.vcf --out simplified_vcf.txt --infos AF,AN,BaseQRankSum,ClippingRankSum --formats PI,GT,PG --pre_header CHROM,POS,REF,ALT,FILTER --mode wide --samples MA605,ms01e  --gtbase yes

Expected output

CHROM   POS REF ALT FILTER  AF  AN  BaseQRankSum    ClippingRankSum MA605_PI    MA605_GT    MA605_PG    ms01e_PI    ms01e_GT    ms01e_PG
2   15881018    G   A,C PASS    1.0 8   -0.771  0.0 .   G/G 0/0 .   ./. ./.
2   15881080    A   G   PASS    0.458   6   -0.732  0.0 .   A/A 0/0 .   ./. .
2   15881106    C   CA  PASS    0.042   6   0.253   0.0 .   C/C 0/0 .   ./. .
2   15881156    A   G   PASS    0.5 6   None    None    .   A/A 0/0 .   ./. .
2   15881224    T   G   PASS    0.036   12  1.75    0.0 .   T/T 0/0 .   ./. ./.
2   15881229    C   G   PASS    0.308   10  None    None    .   C/C 0/0 .   ./. ./.


Report simiplified output (all available fields) for sample MA605,ms01e

python3 vcf_simplify-v2.py --vcf F1.phased_variants.Final02.vcf --out simplified_vcf.txt --samples MA605,ms01e

Expected output

CHROM   POS ID  REF ALT QUAL    FILTER  AF  BaseQRankSum    ClippingRankSum DP  DS  END ExcessHet   FS  HaplotypeScore  InbreedingCoeff MLEAC   MLEAF   MQ  MQRankSum   QD  RAW_MQ  ReadPosRankSum  SOR set SF  AC  AN  MA605_AD    MA605_DP    MA605_GQ    MA605_GT    MA605_MIN_DP    MA605_PGT   MA605_PID   MA605_PL    MA605_RGQ   MA605_SB    MA605_PG    MA605_PB    MA605_PI    MA605_PM    MA605_PW    MA605_PC    ms01e_AD    ms01e_DP    ms01e_GQ    ms01e_GT    ms01e_MIN_DP    ms01e_PGT   ms01e_PID   ms01e_PL    ms01e_RGQ   ms01e_SB    ms01e_PG    ms01e_PB    ms01e_PI    ms01e_PM    ms01e_PW    ms01e_PC
2   15881018    .   G   A,C 5082.45 PASS    1.0 -0.771  0.0 902 None    None    0.005   0.0 None    0.8 12,1    0.462,0.038 60.29   0.0 33.99   None    0.26    0.657   HignConfSNPs    0,1,2,3,4,5,6   2,0 8   3,0,0   3   9   0/0 None    None    None    0,9,112,9,112,112   None    None    0/0 .   .   .   0/0 .   0,0 0   .   ./. None    None    None    0,0,0,.,.,. None    None    ./. .   .   .   ./. .
2   15881080    .   A   G   4336.44 PASS    0.458   -0.732  0.0 729 None    None    0.01    0.0 None    0.826   11  0.458   60.0    0.0 34.24   None    -0.414  0.496   HignConfSNPs    4,5,6   0   6   5,0 5   15  0/0 None    None    None    0,15,181    None    None    0/0 .   .   .   0/0 .   .   .   .   .   None    None    None    .   None    None    .   .   .   .   .   .
2   15881106    .   C   CA  33.32   PASS    0.042   0.253   0.0 654 None    None    3.01    0.0 None    -0.047  1   0.042   60.0    0.0 6.66    None    0.253   0.223   HignConfSNPs    4,5,6   0   6   6,0 6   18  0/0 None    None    None    0,18,206    None    None    0/0 .   .   .   0/0 .   .   .   .   .   None    None    None    .   None    None    .   .   .   .   .   .


Report simplified output in "long" format

python3 vcf_simplify-v2.py --vcf F1.phased_variants.Final02.vcf --out simplified_vcf.txt --infos AF,AN,BaseQRankSum,ClippingRankSum --formats PI,GT,PG --pre_header CHROM,POS,REF,ALT,FILTER --mode long --samples MA605,ms01e

Expected output

CHROM   POS REF ALT FILTER  AF  AN  BaseQRankSum    ClippingRankSum SAMPLE  PI  GT  PG
2   15881018    G   A,C PASS    1.0 8   -0.771  0.0 MA605   .   0/0 0/0
2   15881018    G   A,C PASS    1.0 8   -0.771  0.0 ms01e   .   ./. ./.
2   15881080    A   G   PASS    0.458   6   -0.732  0.0 MA605   .   0/0 0/0
2   15881080    A   G   PASS    0.458   6   -0.732  0.0 ms01e   .   .   .
2   15881106    C   CA  PASS    0.042   6   0.253   0.0 MA605   .   0/0 0/0
2   15881106    C   CA  PASS    0.042   6   0.253   0.0 ms01e   .   .   .
2   15881156    A   G   PASS    0.5 6   None    None    MA605   .   0/0 0/0
2   15881156    A   G   PASS    0.5 6   None    None    ms01e   .   .   .
2   15881224    T   G   PASS    0.036   12  1.75    0.0 MA605   .   0/0 0/0
2   15881224    T   G   PASS    0.036   12  1.75    0.0 ms01e   .   ./. ./.
2   15881229    C   G   PASS    0.308   10  None    None    MA605   .   0/0 0/0
2   15881229    C   G   PASS    0.308   10  None    None    ms01e   .   ./. ./.


#f03c15 Upcoming features:

  • Ability to add genotype bases for fields other than "GT".
  • Write the table back to a VCF file.

    Citation: Giri, B.K, (2018). VCF-simplify: A vcf simpification tool.

variants tool genome vcf • 309 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by kirannbishwa01810

How does the tool handle SVs? In particular, what happens with variants reported as a) symbolic alleles or b) in breakend notation?

ADD REPLYlink written 3 months ago by d-cameron1.8k

I haven't dealt with that directly. My assumption is it should report as it is.

This tool is meant to simplify the VCF output in most possible simple way. So, the simplification is only limited to the structure of the output. It doesn't interpret field,tags.

Specific Interpreted extraction of tags/fields should be dealt by using pyvcf, cyvcf by writing custom methods. The only field that is converted into interpretable value is only for "GT" field.

See the provided examples.

Hope it helps !

ADD REPLYlink modified 3 months ago • written 3 months ago by kirannbishwa01810

Looks nice for users with minimal programming experience. Have you tested it on a wide range of VCFs?

In addition, for large multi-sample VCFs, you may consider adding some functions that can do what I have done here:

ADD REPLYlink written 3 months ago by Kevin Blighe24k
1

@Kevin : Looks like a good add on methods to the tool. The implementation shouldn't take long.

I will have to think if there is already a GATK method to do so for "A" and add it to the INFO field, so I am not reinventing the wheel. "B" looks like an extensive version of "B".

Thanks,

ADD REPLYlink written 3 months ago by kirannbishwa01810

I had sometime to think over your question. Symbolic variants can only be mined if cyvcf2 has a method built into it - which I think there is none. Because, symbolic allele would need a method to interpret that symbol (be it a deletion overlapping variant, inversion etc.). Symbolic variants only make sense when related to alignment data (SAM, BAM) and cannot be interpreted solely based on REF and ALT alleles; therefore cannot be extracted by cyvcf2 directly. Hence, VCF simplify is limited on being able to interpret symbolic variants.

Hope this type of issues will change in the future.

ADD REPLYlink modified 3 months ago • written 3 months ago by kirannbishwa01810
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 701 users visited in the last hour