Faster filtering of vcfs using split-vep over filter_vep
0
1
Entering edit mode
4.1 years ago
tacrolimus ▴ 140

Dear Biostars,

I am trying to filter a multi-sample vcf which has been annotated with VEP in order to get a set of rare likely deleterious calls. However, the file is very large and using "filter_vep" is taking a very long time per file (>5 days per chromosome on a HPC environment). I have been told that the bcftools add-on:split-vep performs better for this and I was wondering how queries using this would look as I have been struggling.

For example:

filter_vep -i my.vcf -o my_filtered.vep --filter "(MAX_AF is  < 0.01 or not MAX_AF) and (CADD_PHRED gte 20 or not CADD_PHRED )"

Could one reproduce this using split-vep - I would want to output the entire vcf line (ideally with the header) so that it remains a vcf file?

Many thanks!

VEP bcftools split-vep vcf • 2.2k views
ADD COMMENT
0
Entering edit mode

Hey omid.alavijeh ,

could you please show the header of the vcf file and the first few variants?

fin swimmer

ADD REPLY
0
Entering edit mode

Hi @finswimmer,

I work in an airlock environment so can't bring data out but it looks like this (taken from another site but basically the same).

head my.vcf
##fileformat=VCFv4.0
##VEP="v91" time="2018-01-04 23:07:28" cache="/home/davetang/.vep/homo_sapiens/91_GRCh37" ensembl-variation=91.c78d8b4 ensembl-io=91.923d668 ensembl=91.18ee742 ensembl-funcgen=91.4681d69 1000genomes="phase3" COSMIC="81" ClinVar="201706" ESP="20141103" HGMD-PUBLIC="20164" assembly="GRCh37.p13" dbSNP="150" gencode="GENCODE 19" genebuild="2011-04" gnomAD="170228" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO 
21      26960070        rs116645811     G       A       .       .       CSQ=A|missense_variant|MODERATE|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|10/11||||1043|1001|334|T/M|aCg/aTg|||-1||HGNC|14027,A|intron_variant|MODIFIER|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding||9/9||||||||||-1||HGNC|14027,A|upstream_gene_variant|MODIFIER|LINC00515|ENSG00000260583|Transcript|ENST00000567517|antisense|||||||||||4432|-1||HGNC|16019
21      26965148        rs1135638       G       A       .       .      CSQ=A|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|8/11||||939|897|299|G|ggC/ggT|||-1||HGNC|14027,A|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|8/10||||939|897|299|G|ggC/ggT|||-1||HGNC|14027,A|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|8/8||||876|867|289|G|ggC/ggT|||-1|cds_end_NF|HGNC|14027
21      26965172        rs10576 T       C       .       .       CSQ=C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|8/11||||915|873|291|P|ccA/ccG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|8/10||||915|873|291|P|ccA/ccG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|8/8||||852|843|281|P|ccA/ccG|||-1|cds_end_NF|HGNC|14027
21      26965205        rs1057885       T       C       .       .      CSQ=C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|8/11||||882|840|280|V|gtA/gtG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|8/10||||882|840|280|V|gtA/gtG|||-1||HGNC|14027,C|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|8/8||||819|810|270|V|gtA/gtG|||-1|cds_end_NF|HGNC|14027
21      26976144        rs116331755     A       G       .       .      CSQ=G|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000307301|protein_coding|3/11||||426|384|128|L|ctT/ctC|||-1||HGNC|14027,G|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000352957|protein_coding|3/10||||426|384|128|L|ctT/ctC|||-1||HGNC|14027,G|synonymous_variant|LOW|MRPL39|ENSG00000154719|Transcript|ENST00000419219|protein_coding|3/8||||393|384|128|L|ctT/ctC|||-1|cds_end_NF|HGNC|14027
21      26976222        rs7278168       C       T       .       .      CSQ=T|synonymous_variant|LOW|MRPL39|
ADD REPLY

Login before adding your answer.

Traffic: 2892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6