Tool:VCF-simplify: a VCF simplification tool.
1
5
Entering edit mode
3.3 years ago
kirannbishwa01 ★ 1.3k

# VCF-simplify.v2

A python parser to simplify the vcf file into table like format. https://github.com/everestial/VCF-simplify

There are several tools available to mainpulate and alter VCF file. But, a simple and comprehensive tool that can produce a most simple output required by emperical biologist is still amiss.

This tool takes in sorted vcf file and reports a simplified table output for INFO and FORMAT field for each SAMPLE of interest. With default state (minimal code) all the INFO, FORMAT for all the SAMPLE are simplified. Fields can be further narrowed down using very convenient and comprehensive scripts. See the examples given below.

The output table can be created in both "long" and "wide" format, which makes it suitable for mining data by samples vs position quite simple. The output can be further filtered downstream with awk and can be loaded onto R and used with tidyr, dplyr where different columns can be accessed by matching names or pre,suf - fixes.

# Prerequisites:

Python packages and modules:

## Usage (using the given input test data):

### Call for available options

python3 vcf_simplify-v2.py --help


### If no options are provided then all the INFO, FORMAT fields are reported from all the SAMPLE

python3 vcf_simplify-v2.py --vcf input_test.vcf --out simplified_vcf.txt


### Report wide output and "GT" as nucleotide bases

python3 vcf_simplify-v2.py --vcf input_test.vcf --out simplified_vcf.txt --infos AF,AN,BaseQRankSum,ClippingRankSum --formats PI,GT,PG --pre_header CHROM,POS,REF,ALT,FILTER --mode wide --samples MA605,ms01e  --gtbase yes


### Expected output

CHROM   POS REF ALT FILTER  AF  AN  BaseQRankSum    ClippingRankSum MA605_PI    MA605_GT    MA605_PG    ms01e_PI    ms01e_GT    ms01e_PG
2   15881018    G   A,C PASS    1.0 8   -0.771  0.0 .   G/G 0/0 .   ./. ./.
2   15881080    A   G   PASS    0.458   6   -0.732  0.0 .   A/A 0/0 .   ./. .
2   15881106    C   CA  PASS    0.042   6   0.253   0.0 .   C/C 0/0 .   ./. .
2   15881156    A   G   PASS    0.5 6   None    None    .   A/A 0/0 .   ./. .
2   15881224    T   G   PASS    0.036   12  1.75    0.0 .   T/T 0/0 .   ./. ./.
2   15881229    C   G   PASS    0.308   10  None    None    .   C/C 0/0 .   ./. ./.


### Report simiplified output (all available fields) for sample MA605,ms01e

python3 vcf_simplify-v2.py --vcf F1.phased_variants.Final02.vcf --out simplified_vcf.txt --samples MA605,ms01e


### Expected output

CHROM   POS ID  REF ALT QUAL    FILTER  AF  BaseQRankSum    ClippingRankSum DP  DS  END ExcessHet   FS  HaplotypeScore  InbreedingCoeff MLEAC   MLEAF   MQ  MQRankSum   QD  RAW_MQ  ReadPosRankSum  SOR set SF  AC  AN  MA605_AD    MA605_DP    MA605_GQ    MA605_GT    MA605_MIN_DP    MA605_PGT   MA605_PID   MA605_PL    MA605_RGQ   MA605_SB    MA605_PG    MA605_PB    MA605_PI    MA605_PM    MA605_PW    MA605_PC    ms01e_AD    ms01e_DP    ms01e_GQ    ms01e_GT    ms01e_MIN_DP    ms01e_PGT   ms01e_PID   ms01e_PL    ms01e_RGQ   ms01e_SB    ms01e_PG    ms01e_PB    ms01e_PI    ms01e_PM    ms01e_PW    ms01e_PC
2   15881018    .   G   A,C 5082.45 PASS    1.0 -0.771  0.0 902 None    None    0.005   0.0 None    0.8 12,1    0.462,0.038 60.29   0.0 33.99   None    0.26    0.657   HignConfSNPs    0,1,2,3,4,5,6   2,0 8   3,0,0   3   9   0/0 None    None    None    0,9,112,9,112,112   None    None    0/0 .   .   .   0/0 .   0,0 0   .   ./. None    None    None    0,0,0,.,.,. None    None    ./. .   .   .   ./. .
2   15881080    .   A   G   4336.44 PASS    0.458   -0.732  0.0 729 None    None    0.01    0.0 None    0.826   11  0.458   60.0    0.0 34.24   None    -0.414  0.496   HignConfSNPs    4,5,6   0   6   5,0 5   15  0/0 None    None    None    0,15,181    None    None    0/0 .   .   .   0/0 .   .   .   .   .   None    None    None    .   None    None    .   .   .   .   .   .
2   15881106    .   C   CA  33.32   PASS    0.042   0.253   0.0 654 None    None    3.01    0.0 None    -0.047  1   0.042   60.0    0.0 6.66    None    0.253   0.223   HignConfSNPs    4,5,6   0   6   6,0 6   18  0/0 None    None    None    0,18,206    None    None    0/0 .   .   .   0/0 .   .   .   .   .   None    None    None    .   None    None    .   .   .   .   .   .


### Report simplified output in "long" format

python3 vcf_simplify-v2.py --vcf F1.phased_variants.Final02.vcf --out simplified_vcf.txt --infos AF,AN,BaseQRankSum,ClippingRankSum --formats PI,GT,PG --pre_header CHROM,POS,REF,ALT,FILTER --mode long --samples MA605,ms01e


### Expected output

CHROM   POS REF ALT FILTER  AF  AN  BaseQRankSum    ClippingRankSum SAMPLE  PI  GT  PG
2   15881018    G   A,C PASS    1.0 8   -0.771  0.0 MA605   .   0/0 0/0
2   15881018    G   A,C PASS    1.0 8   -0.771  0.0 ms01e   .   ./. ./.
2   15881080    A   G   PASS    0.458   6   -0.732  0.0 MA605   .   0/0 0/0
2   15881080    A   G   PASS    0.458   6   -0.732  0.0 ms01e   .   .   .
2   15881106    C   CA  PASS    0.042   6   0.253   0.0 MA605   .   0/0 0/0
2   15881106    C   CA  PASS    0.042   6   0.253   0.0 ms01e   .   .   .
2   15881156    A   G   PASS    0.5 6   None    None    MA605   .   0/0 0/0
2   15881156    A   G   PASS    0.5 6   None    None    ms01e   .   .   .
2   15881224    T   G   PASS    0.036   12  1.75    0.0 MA605   .   0/0 0/0
2   15881224    T   G   PASS    0.036   12  1.75    0.0 ms01e   .   ./. ./.
2   15881229    C   G   PASS    0.308   10  None    None    MA605   .   0/0 0/0
2   15881229    C   G   PASS    0.308   10  None    None    ms01e   .   ./. ./.


## Upcoming features:

• Ability to add genotype bases for fields other than "GT".
• Write the table back to a VCF file.

Citation: Giri, B.K, (2018). VCF-simplify: A vcf simpification tool.

VCF variants genome Tool • 2.1k views
0
Entering edit mode

How does the tool handle SVs? In particular, what happens with variants reported as a) symbolic alleles or b) in breakend notation?

0
Entering edit mode

I haven't dealt with that directly. My assumption is it should report as it is.

This tool is meant to simplify the VCF output in most possible simple way. So, the simplification is only limited to the structure of the output. It doesn't interpret field,tags.

Specific Interpreted extraction of tags/fields should be dealt by using pyvcf, cyvcf by writing custom methods. The only field that is converted into interpretable value is only for "GT" field.

See the provided examples.

Hope it helps !

0
Entering edit mode

Looks nice for users with minimal programming experience. Have you tested it on a wide range of VCFs?

In addition, for large multi-sample VCFs, you may consider adding some functions that can do what I have done here:

1
Entering edit mode

@Kevin : Looks like a good add on methods to the tool. The implementation shouldn't take long.

I will have to think if there is already a GATK method to do so for "A" and add it to the INFO field, so I am not reinventing the wheel. "B" looks like an extensive version of "B".

Thanks,

0
Entering edit mode

I had sometime to think over your question. Symbolic variants can only be mined if cyvcf2 has a method built into it - which I think there is none. Because, symbolic allele would need a method to interpret that symbol (be it a deletion overlapping variant, inversion etc.). Symbolic variants only make sense when related to alignment data (SAM, BAM) and cannot be interpreted solely based on REF and ALT alleles; therefore cannot be extracted by cyvcf2 directly. Hence, VCF simplify is limited on being able to interpret symbolic variants.

Hope this type of issues will change in the future.

1
Entering edit mode
23 months ago
kirannbishwa01 ★ 1.3k

VCFSIMPLIFY is updated with new features - https://github.com/everestial/VCF-Simplify/releases/tag/v3.0.1 , https://github.com/everestial/VCF-Simplify

• External dependencies are removed, so only python 3.6 or above is enough.
• Option to cythonize has been added to optimize run time.
• More control has been provided to include/exclude/extract several VCF metadata and records information.

Thanks,

0
Entering edit mode

Hi , It is great to have a simplified way to convert VCF to a txt output. Does the current version of VCFSIMPLIFY has an option to write out zygosity( HOM or HET) into a column?

0
Entering edit mode

It's already doing that with the 0/0, 0/1 etc. You'll probably need to use a simple awk to process the GT columns and output HOM or HET as you see fit.

0
Entering edit mode

@qaedi This tool doesn't support this feature at the moment. As suggested by RamRS you may write a simple python or awk script. I however do have a plan to add that feature in a VCF parser I am currently building and also add that to VCF SIMPLIFY. It will be sometime before it rolls out though.