filter_vep not filtering according to my --filter criteria
1
0
Entering edit mode
11 months ago
jan ▴ 170

Hi,

I'm trying to filter my VCFs using filter_vep (https://asia.ensembl.org/info/docs/tools/vep/script/vep_filter.html) following certain criteria. Variants in my output need to pass all filters.

filter_vep \
        --input_file input.vcf.gz \
        --output_file out.vcf \
        --format vcf \
        --force_overwrite \
        --only_matched \
        --filter "CANONICAL is YES" \
        --filter "BIOTYPE is protein_coding"\
        --filter "gnomAD_AF < 0.01 or not gnomAD_AF" \
        --filter "(IMPACT is HIGH and (Aloft_pred match Recessive or Aloft_pred match Dominant)) or (REVEL > 0.5) or (VEST4_rankscore > 0.5) or (MaxEntScan_diff > 0 and MaxEntScan_alt <= 8.5) or (CADD_phred > 30 and (phastCons30way_mammalian_rankscore > 0.8 or phyloP30way_mammalian_rankscore > 0.8 or GERP++_RS_rankscore > 0.8))"

However, I keep getting non-canonical transcripts and biotypes other than protein_coding, such as lncRNA in my outputs. From what I understood, multiple --filter flags may be used, and are treated as logical ANDs, i.e. all filters must pass for a line to be printed. Not sure what am I doing wrong here. Could anyone help to point any errors/issues in my script?

Here's an example of a variant in the output file following filter_vep:

chr1    2556714 .       A       G       672.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=0.284;DP=41;ExcessHet=3.0103;FS=6.967;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=16.41;ReadPosRankSum=1.19;SOR=0.454;CSQ=G|intron_variant&non_coding_transcript_variant|MODIFIER|TNFRSF14-AS1|ENSG00000238164|Transcript|ENST00000416860|lncRNA||1/5|ENST00000416860.2:n.36-18T>C|||||||rs4870||-1||SNV|HGNC|HGNC:26966|||2|||||||||||||0.6148|0.7837|0.5303|0.5397|0.4682|0.6748|0.7263|0.472|0.5136|0.7267|0.5108|0.4422|0.4915|0.4894|0.4669|0.4949|0.6332|0.7837|AFR|not_provided||1|24728327&19825846|ClinVar::VCV000135349&RCV000122164--Uniprot::VAR_013007||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||2||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||3.056|0.318|3.375|||,G|intron_variant&non_coding_transcript_variant|MODIFIER|TNFRSF14-AS1|ENSG00000238164|Transcript|ENST00000452793|lncRNA||1/3|ENST00000452793.1:n.56-18T>C|||||||rs4870||-1||SNV|HGNC|HGNC:26966|||3|||||||||||||0.6148|0.7837|0.5303|0.5397|0.4682|0.6748|0.7263|0.472|0.5136|0.7267|0.5108|0.4422|0.4915|0.4894|0.4669|0.4949|0.6332|0.7837|AFR|not_provided||1|24728327&19825846|ClinVar::VCV000135349&RCV000122164--Uniprot::VAR_013007||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||3||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||3.056|0.318|3.375|||     GT:AD:DP:GQ:PL  0/1:17,24:41:99:701,0,458

Here's the CSQ field:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|VAR_SYNONYMS|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|REVEL|1000Gp3_AC|1000Gp3_AF|1000Gp3_AFR_AC|1000Gp3_AFR_AF|1000Gp3_AMR_AC|1000Gp3_AMR_AF|1000Gp3_EAS_AC|1000Gp3_EAS_AF|1000Gp3_EUR_AC|1000Gp3_EUR_AF|1000Gp3_SAS_AC|1000Gp3_SAS_AF|ALSPAC_AC|ALSPAC_AF|APPRIS|Aloft_Confidence|Aloft_Fraction_transcripts_affected|Aloft_pred|Aloft_prob_Dominant|Aloft_prob_Recessive|Aloft_prob_Tolerant|AltaiNeandertal|Ancestral_allele|CADD_phred|CADD_raw|CADD_raw_rankscore|DANN_rankscore|DANN_score|DEOGEN2_pred|DEOGEN2_rankscore|DEOGEN2_score|Denisova|ESP6500_AA_AC|ESP6500_AA_AF|ESP6500_EA_AC|ESP6500_EA_AF|Eigen-PC-phred_coding|Eigen-PC-raw_coding|Eigen-PC-raw_coding_rankscore|Eigen-pred_coding|Eigen-raw_coding|Eigen-raw_coding_rankscore|Ensembl_geneid|Ensembl_proteinid|Ensembl_transcriptid|ExAC_AC|ExAC_AF|ExAC_AFR_AC|ExAC_AFR_AF|ExAC_AMR_AC|ExAC_AMR_AF|ExAC_Adj_AC|ExAC_Adj_AF|ExAC_EAS_AC|ExAC_EAS_AF|ExAC_FIN_AC|ExAC_FIN_AF|ExAC_NFE_AC|ExAC_NFE_AF|ExAC_SAS_AC|ExAC_SAS_AF|ExAC_nonTCGA_AC|ExAC_nonTCGA_AF|ExAC_nonTCGA_AFR_AC|ExAC_nonTCGA_AFR_AF|ExAC_nonTCGA_AMR_AC|ExAC_nonTCGA_AMR_AF|ExAC_nonTCGA_Adj_AC|ExAC_nonTCGA_Adj_AF|ExAC_nonTCGA_EAS_AC|ExAC_nonTCGA_EAS_AF|ExAC_nonTCGA_FIN_AC|ExAC_nonTCGA_FIN_AF|ExAC_nonTCGA_NFE_AC|ExAC_nonTCGA_NFE_AF|ExAC_nonTCGA_SAS_AC|ExAC_nonTCGA_SAS_AF|ExAC_nonpsych_AC|ExAC_nonpsych_AF|ExAC_nonpsych_AFR_AC|ExAC_nonpsych_AFR_AF|ExAC_nonpsych_AMR_AC|ExAC_nonpsych_AMR_AF|ExAC_nonpsych_Adj_AC|ExAC_nonpsych_Adj_AF|ExAC_nonpsych_EAS_AC|ExAC_nonpsych_EAS_AF|ExAC_nonpsych_FIN_AC|ExAC_nonpsych_FIN_AF|ExAC_nonpsych_NFE_AC|ExAC_nonpsych_NFE_AF|ExAC_nonpsych_SAS_AC|ExAC_nonpsych_SAS_AF|FATHMM_converted_rankscore|FATHMM_pred|FATHMM_score|GENCODE_basic|GERP++_NR|GERP++_RS|GERP++_RS_rankscore|GM12878_confidence_value|GM12878_fitCons_rankscore|GM12878_fitCons_score|GTEx_V7_gene|GTEx_V7_tissue|GenoCanyon_rankscore|GenoCanyon_score|Geuvadis_eQTL_target_gene|H1-hESC_confidence_value|H1-hESC_fitCons_rankscore|H1-hESC_fitCons_score|HGVSc_ANNOVAR|HGVSc_VEP|HGVSc_snpEff|HGVSp_ANNOVAR|HGVSp_VEP|HGVSp_snpEff|HUVEC_confidence_value|HUVEC_fitCons_rankscore|HUVEC_fitCons_score|Interpro_domain|LINSIGHT|LINSIGHT_rankscore|LRT_Omega|LRT_converted_rankscore|LRT_pred|LRT_score|M-CAP_pred|M-CAP_rankscore|M-CAP_score|MPC_rankscore|MPC_score|MVP_rankscore|MVP_score|MetaLR_pred|MetaLR_rankscore|MetaLR_score|MetaSVM_pred|MetaSVM_rankscore|MetaSVM_score|MutPred_AAchange|MutPred_Top5features|MutPred_protID|MutPred_rankscore|MutPred_score|MutationAssessor_pred|MutationAssessor_rankscore|MutationAssessor_score|MutationTaster_AAE|MutationTaster_converted_rankscore|MutationTaster_model|MutationTaster_pred|MutationTaster_score|PROVEAN_converted_rankscore|PROVEAN_pred|PROVEAN_score|Polyphen2_HDIV_pred|Polyphen2_HDIV_rankscore|Polyphen2_HDIV_score|Polyphen2_HVAR_pred|Polyphen2_HVAR_rankscore|Polyphen2_HVAR_score|PrimateAI_pred|PrimateAI_rankscore|PrimateAI_score|REVEL_rankscore|REVEL_score|Reliability_index|SIFT4G_converted_rankscore|SIFT4G_pred|SIFT4G_score|SIFT_converted_rankscore|SIFT_pred|SIFT_score|SiPhy_29way_logOdds|SiPhy_29way_logOdds_rankscore|SiPhy_29way_pi|TSL|TWINSUK_AC|TWINSUK_AF|UK10K_AC|UK10K_AF|Uniprot_acc|Uniprot_entry|VEP_canonical|VEST4_rankscore|VEST4_score|VindijiaNeandertal|aaalt|aapos|aaref|alt|bStatistic|bStatistic_rankscore|cds_strand|chr|clinvar_MedGen_id|clinvar_OMIM_id|clinvar_Orphanet_id|clinvar_clnsig|clinvar_hgvs|clinvar_id|clinvar_review|clinvar_trait|clinvar_var_source|codon_degeneracy|codonpos|fathmm-MKL_coding_group|fathmm-MKL_coding_pred|fathmm-MKL_coding_rankscore|fathmm-MKL_coding_score|fathmm-XF_coding_pred|fathmm-XF_coding_rankscore|fathmm-XF_coding_score|genename|gnomAD_exomes_AC|gnomAD_exomes_AF|gnomAD_exomes_AFR_AC|gnomAD_exomes_AFR_AF|gnomAD_exomes_AFR_AN|gnomAD_exomes_AFR_nhomalt|gnomAD_exomes_AMR_AC|gnomAD_exomes_AMR_AF|gnomAD_exomes_AMR_AN|gnomAD_exomes_AMR_nhomalt|gnomAD_exomes_AN|gnomAD_exomes_ASJ_AC|gnomAD_exomes_ASJ_AF|gnomAD_exomes_ASJ_AN|gnomAD_exomes_ASJ_nhomalt|gnomAD_exomes_EAS_AC|gnomAD_exomes_EAS_AF|gnomAD_exomes_EAS_AN|gnomAD_exomes_EAS_nhomalt|gnomAD_exomes_FIN_AC|gnomAD_exomes_FIN_AF|gnomAD_exomes_FIN_AN|gnomAD_exomes_FIN_nhomalt|gnomAD_exomes_NFE_AC|gnomAD_exomes_NFE_AF|gnomAD_exomes_NFE_AN|gnomAD_exomes_NFE_nhomalt|gnomAD_exomes_POPMAX_AC|gnomAD_exomes_POPMAX_AF|gnomAD_exomes_POPMAX_AN|gnomAD_exomes_POPMAX_nhomalt|gnomAD_exomes_SAS_AC|gnomAD_exomes_SAS_AF|gnomAD_exomes_SAS_AN|gnomAD_exomes_SAS_nhomalt|gnomAD_exomes_controls_AC|gnomAD_exomes_controls_AF|gnomAD_exomes_controls_AFR_AC|gnomAD_exomes_controls_AFR_AF|gnomAD_exomes_controls_AFR_AN|gnomAD_exomes_controls_AFR_nhomalt|gnomAD_exomes_controls_AMR_AC|gnomAD_exomes_controls_AMR_AF|gnomAD_exomes_controls_AMR_AN|gnomAD_exomes_controls_AMR_nhomalt|gnomAD_exomes_controls_AN|gnomAD_exomes_controls_ASJ_AC|gnomAD_exomes_controls_ASJ_AF|gnomAD_exomes_controls_ASJ_AN|gnomAD_exomes_controls_ASJ_nhomalt|gnomAD_exomes_controls_EAS_AC|gnomAD_exomes_controls_EAS_AF|gnomAD_exomes_controls_EAS_AN|gnomAD_exomes_controls_EAS_nhomalt|gnomAD_exomes_controls_FIN_AC|gnomAD_exomes_controls_FIN_AF|gnomAD_exomes_controls_FIN_AN|gnomAD_exomes_controls_FIN_nhomalt|gnomAD_exomes_controls_NFE_AC|gnomAD_exomes_controls_NFE_AF|gnomAD_exomes_controls_NFE_AN|gnomAD_exomes_controls_NFE_nhomalt|gnomAD_exomes_controls_POPMAX_AC|gnomAD_exomes_controls_POPMAX_AF|gnomAD_exomes_controls_POPMAX_AN|gnomAD_exomes_controls_POPMAX_nhomalt|gnomAD_exomes_controls_SAS_AC|gnomAD_exomes_controls_SAS_AF|gnomAD_exomes_controls_SAS_AN|gnomAD_exomes_controls_SAS_nhomalt|gnomAD_exomes_controls_nhomalt|gnomAD_exomes_flag|gnomAD_exomes_nhomalt|gnomAD_genomes_AC|gnomAD_genomes_AF|gnomAD_genomes_AFR_AC|gnomAD_genomes_AFR_AF|gnomAD_genomes_AFR_AN|gnomAD_genomes_AFR_nhomalt|gnomAD_genomes_AMR_AC|gnomAD_genomes_AMR_AF|gnomAD_genomes_AMR_AN|gnomAD_genomes_AMR_nhomalt|gnomAD_genomes_AN|gnomAD_genomes_ASJ_AC|gnomAD_genomes_ASJ_AF|gnomAD_genomes_ASJ_AN|gnomAD_genomes_ASJ_nhomalt|gnomAD_genomes_EAS_AC|gnomAD_genomes_EAS_AF|gnomAD_genomes_EAS_AN|gnomAD_genomes_EAS_nhomalt|gnomAD_genomes_FIN_AC|gnomAD_genomes_FIN_AF|gnomAD_genomes_FIN_AN|gnomAD_genomes_FIN_nhomalt|gnomAD_genomes_NFE_AC|gnomAD_genomes_NFE_AF|gnomAD_genomes_NFE_AN|gnomAD_genomes_NFE_nhomalt|gnomAD_genomes_POPMAX_AC|gnomAD_genomes_POPMAX_AF|gnomAD_genomes_POPMAX_AN|gnomAD_genomes_POPMAX_nhomalt|gnomAD_genomes_controls_AC|gnomAD_genomes_controls_AF|gnomAD_genomes_controls_AFR_AC|gnomAD_genomes_controls_AFR_AF|gnomAD_genomes_controls_AFR_AN|gnomAD_genomes_controls_AFR_nhomalt|gnomAD_genomes_controls_AMR_AC|gnomAD_genomes_controls_AMR_AF|gnomAD_genomes_controls_AMR_AN|gnomAD_genomes_controls_AMR_nhomalt|gnomAD_genomes_controls_AN|gnomAD_genomes_controls_ASJ_AC|gnomAD_genomes_controls_ASJ_AF|gnomAD_genomes_controls_ASJ_AN|gnomAD_genomes_controls_ASJ_nhomalt|gnomAD_genomes_controls_EAS_AC|gnomAD_genomes_controls_EAS_AF|gnomAD_genomes_controls_EAS_AN|gnomAD_genomes_controls_EAS_nhomalt|gnomAD_genomes_controls_FIN_AC|gnomAD_genomes_controls_FIN_AF|gnomAD_genomes_controls_FIN_AN|gnomAD_genomes_controls_FIN_nhomalt|gnomAD_genomes_controls_NFE_AC|gnomAD_genomes_controls_NFE_AF|gnomAD_genomes_controls_NFE_AN|gnomAD_genomes_controls_NFE_nhomalt|gnomAD_genomes_controls_POPMAX_AC|gnomAD_genomes_controls_POPMAX_AF|gnomAD_genomes_controls_POPMAX_AN|gnomAD_genomes_controls_POPMAX_nhomalt|gnomAD_genomes_controls_nhomalt|gnomAD_genomes_flag|gnomAD_genomes_nhomalt|hg18_chr|hg18_pos(1-based)|hg19_chr|hg19_pos(1-based)|integrated_confidence_value|integrated_fitCons_rankscore|integrated_fitCons_score|phastCons100way_vertebrate|phastCons100way_vertebrate_rankscore|phastCons17way_primate|phastCons17way_primate_rankscore|phastCons30way_mammalian|phastCons30way_mammalian_rankscore|phyloP100way_vertebrate|phyloP100way_vertebrate_rankscore|phyloP17way_primate|phyloP17way_primate_rankscore|phyloP30way_mammalian|phyloP30way_mammalian_rankscore|pos(1-based)|ref|refcodon|rs_dbSNP151|TSSDistance|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|GO|miRNA|FunMotifs">
vep filter_vep • 557 views
ADD COMMENT
1
Entering edit mode
11 months ago
Ben_Ensembl ★ 1.8k

Hi Jan,

The filters look alright but the input is a gzipped file, probably what is missing is the option --gz to read the file.

Best wishes

Ben

ADD COMMENT
0
Entering edit mode

Thanks Ben. I dont think --gz flag was the issued. I still can get the ouput without the --gz flag. But variants in my outputs dont passed all of the filters

When I combined all my filters in one line such as below without the --gz flag, i get my desired output. ie variants passing all my filters. But when I use multiple --filters flag, it seemed to treat each filters separately and behaving more line an "OR" operator.

filter_vep \
        --input_file input.vcf.gz \
        --output_file output.vcf \
        --format vcf \
        --force_overwrite \
        --only_matched \
        --filter "(CANONICAL is YES) and (BIOTYPE is protein_coding) and (SYMBOL) and (gnomAD_AF < 0.01 or not gnomAD_AF) and ((IMPACT is HIGH and (Aloft_pred match Recessive or Aloft_pred match Dominant or not Aloft_pred)) or (REVEL > 0.5) or (VEST4_rankscore > 0.5) or (MaxEntScan_diff > 0 and MaxEntScan_alt <= 8.5) or (CADD_phred > 30 and (phastCons30way_mammalian_rankscore > 0.8 or phyloP30way_mammalian_rankscore > 0.8 or GERP++_RS_rankscore > 0.8)))"
ADD REPLY
0
Entering edit mode

Hi Jan,

Yes, when multiple filters are used they behave like 'AND' operators. Looking in more detail, it seems that the problem is the last filter is missing some parentheses:

--filter "((IMPACT is HIGH and (Aloft_pred match Recessive or Aloft_pred match Dominant)) or (REVEL > 0.5) or (VEST4_rankscore > 0.5) or (MaxEntScan_diff > 0 and MaxEntScan_alt <= 8.5) or (CADD_phred > 30 and (phastCons30way_mammalian_rankscore > 0.8 or phyloP30way_mammalian_rankscore > 0.8 or GERP++_RS_rankscore > 0.8)))"

Without the parentheses when the filters are merged, the final filter is:

... gnomAD_AF < 0.01 or not gnomAD_AF AND (IMPACT is HIGH and (Aloft_pred match Recessive or Aloft_pred match Dominant)) OR (REVEL > 0.5) OR (VEST4_rankscore > 0.5) OR (MaxEntScan_diff > 0 and MaxEntScan_alt <= 8.5) OR (CADD_phred > 30 and (phastCons30way_mammalian_rankscore > 0.8 or phyloP30way_mammalian_rankscore > 0.8 or GERP++_RS_rankscore > 0.8))

Instead of being:

... gnomAD_AF < 0.01 or not gnomAD_AF AND ((IMPACT is HIGH and (Aloft_pred match Recessive or Aloft_pred match Dominant)) or (REVEL > 0.5) OR (VEST4_rankscore > 0.5) or (MaxEntScan_diff > 0 and MaxEntScan_alt <= 8.5) or (CADD_phred > 30 and (phastCons30way_mammalian_rankscore > 0.8 or phyloP30way_mammalian_rankscore > 0.8 or GERP++_RS_rankscore > 0.8)))
ADD REPLY
0
Entering edit mode

Thanks for the correction Ben. It is now working as intended.

ADD REPLY

Login before adding your answer.

Traffic: 2076 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6