VEP ENSEMBL unexpected output when considering vcf insertions.
1
0
Entering edit mode
6.4 years ago

I've been analysing some DNAnextgen datasets over the past months, and performing thourough analysis on the datasets itself and on the tools used to analyse it.

In one of the last steps, I've come across a vcf file (simplified to generate a reproducable situation to the txt file below):

[cedric@laptop]:/data_error$ tail insertions.vcf |cut -f 1,2,4,5,6
#CHROM  POS REF ALT QUAL
chrX    48991024    T   TG  .
chrX    48996531    T   C   .
chrX    49068386    C   CT  .
chrX    49068452    C   T   .
chrX    49068845    CTG C   .
chrX    123697721   C   CT  .
chrX    135052062   A   AG  .

Running vep on this vcf file:

../../vep/vep -i insertions.vcf -o outputvep_inserts --cache --force_overwrite --symbol

Generates the following vep output file :

cat outputvep_inserts|cut -f 2,3,11,12|uniq
Location    Allele  Amino_acids Codons
chrX:48991024-48991025  G   -/X -/C
chrX:48991024-48991025  G   -   -
chrX:48991024-48991025  G   -/X -/C
chrX:48991024-48991025  G   -   -
chrX:48991024-48991025  G   P/PX    cca/ccCa
chrX:48991024-48991025  G   -/X -/C
chrX:48996531   C   -   -
chrX:49068386-49068387  T   -   -
chrX:49068452   T   -   -
chrX:49068846-49068847  -   -   -
chrX:123697721-123697722    T   S/KX    agc/aAgc
chrX:123697721-123697722    T   -   -
chrX:123697721-123697722    T   S/KX    agc/aAgc
chrX:123697721-123697722    T   -   -
chrX:135052062-135052063    G   L/PX    ctc/cCtc
chrX:135052062-135052063    G   -   -

Could anybody verify that the Codon column is in fact the results we want to achieve? Cause to me it seems that, although the "Allele" field mentions the correct nucleotide most of the time, the codon field itself seems to mention completely different insertions/deletions than the vcf file.

Please note that the vcf file is human, and generated from Mutect2 according to GATK best practices.This vcf file contain the erroneous entries of the 20000 entries contained in the full vcf file, on which it does behave as expected. So I don't suspect any installation error to be the cause of this behaviour.

Thanks in advance, Cedric

full vep output:

 [cedric@laptop]:/Mupexi/Mupexi/data_error$ cat outputvep_inserts

> ## ENSEMBL VARIANT EFFECT PREDICTOR v90.9
> ## Output produced at 2017-12-05 15:40:58
> ## Using cache in /media/cedric/Extra_space_linu/.vep/homo_sapiens/90_GRCh38
> ## Using API version 90, DB version ?
> ## ensembl-io version 90.9a148ea
> ## ensembl-variation version 90.00c29b7
> ## ensembl-funcgen version 90.743f32b
> ## ensembl version 90.4a44397
> ## dbSNP version 150
> ## ESP version V2-SSA137
> ## gencode version GENCODE 27
> ## 1000genomes version phase3
> ## ClinVar version 201706
> ## sift version sift5.2.2
> ## regbuild version 16
> ## genebuild version 2014-07
> ## assembly version GRCh38.p10
> ## COSMIC version 81
> ## gnomAD version 170228
> ## polyphen version 2.2.2
> ## HGMD-PUBLIC version 20164
> ## Column descriptions:
> ## Uploaded_variation : Identifier of uploaded variant
> ## Location : Location of variant in standard coordinate format (chr:start or chr:start-end)
> ## Allele : The variant allele used to calculate the consequence
> ## Gene : Stable ID of affected gene
> ## Feature : Stable ID of feature
> ## Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature
> ## Consequence : Consequence type
> ## cDNA_position : Relative position of base pair in cDNA sequence
> ## CDS_position : Relative position of base pair in coding sequence
> ## Protein_position : Relative position of amino acid in protein
> ## Amino_acids : Reference and variant amino acids
> ## Codons : Reference and variant codon sequence
> ## Existing_variation : Identifier(s) of co-located known variants
> ## Extra column keys:
> ## IMPACT : Subjective impact classification of consequence type
> ## DISTANCE : Shortest distance from variant to transcript
> ## STRAND : Strand of the feature (1/-1)
> ## FLAGS : Transcript quality flags
> ## SYMBOL : Gene symbol (e.g. HGNC)
> ## SYMBOL_SOURCE : Source of gene symbol
> ## HGNC_ID : Stable identifer of HGNC gene symbol
> #Uploaded_variation   Location    Allele  Gene    Feature Feature_type    Consequence cDNA_position   CDS_position    Protein_position    Amino_acids Codons  Existing_variation  Extra . chrX:48991024-48991025  G   ENSG00000068400 ENST00000376423 Transcript  frameshift_variant  578-579 543-544 181-182 -/X -/C -   IMPACT=HIGH;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025  G   ENSG00000068400 ENST00000473581 Transcript  non_coding_transcript_exon_variant  362-363 -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025  G   ENSG00000068400 ENST00000474512 Transcript  upstream_gene_variant   -   -   -   -   -   -   IMPACT=MODIFIER;DISTANCE=2373;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025  G   ENSG00000068400 ENST00000593475 Transcript  frameshift_variant  548-549 543-544 181-182 -/X -/C -   IMPACT=HIGH;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025  G   ENSG00000068400 ENST00000611757 Transcript  non_coding_transcript_exon_variant  417-418 -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025  G   ENSG00000068400 ENST00000617369 Transcript  downstream_gene_variant -   -   -   -   -   -   IMPACT=MODIFIER;DISTANCE=2423;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025  G   ENSG00000068400 ENST00000619149 Transcript  upstream_gene_variant   -   -   -   -   -   -   IMPACT=MODIFIER;DISTANCE=3207;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025  G   ENSG00000068400 ENST00000621664 Transcript  frameshift_variant,NMD_transcript_variant   144-145 146-147 49  P/PX    cca/ccCa    -   IMPACT=HIGH;STRAND=-1;FLAGS=cds_start_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025  G   ENSG00000068400 ENST00000622231 Transcript  frameshift_variant  352-353 348-349 116-117 -/X -/C -   IMPACT=HIGH;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48991024-48991025  G   ENSG00000068400 ENST00000622599 Transcript  frameshift_variant  428-429 408-409 136-137 -/X -/C -   IMPACT=HIGH;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000376423 Transcript  intron_variant  -   -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000480041 Transcript  downstream_gene_variant -   -   -   -   -   -   IMPACT=MODIFIER;DISTANCE=1417;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000495258 Transcript  downstream_gene_variant -   -   -   -   -   -   IMPACT=MODIFIER;DISTANCE=2443;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000593475 Transcript  intron_variant  -   -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000611705 Transcript  downstream_gene_variant -   -   -   -   -   -   IMPACT=MODIFIER;DISTANCE=728;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000611757 Transcript  intron_variant,non_coding_transcript_variant    -   -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000617369 Transcript  intron_variant  -   -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000621664 Transcript  intron_variant,NMD_transcript_variant   -   -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;FLAGS=cds_start_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000622231 Transcript  intron_variant  -   -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;FLAGS=cds_end_NF;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706
> . chrX:48996531   C   ENSG00000068400 ENST00000622599 Transcript  intron_variant  -   -   -   -   -   -   IMPACT=MODIFIER;STRAND=-1;SYMBOL=GRIPAP1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:18706

..

snp software error ensembl vep variants • 1.7k views
ADD COMMENT
0
Entering edit mode

Tagging: Emily_Ensembl

ADD REPLY
0
Entering edit mode

I am not close to the computer but I bet those genes are on the minus strand.

ADD REPLY
2
Entering edit mode
6.4 years ago
Emily 23k

@WouterDeCoster is right. The allele column will report the forward strand, the codons column will report the strand of the transcript that is hit, which may be the forward or reverse strand. This means that around half the time, the base in the codons column will be the complement of the base in the allele column.

ADD COMMENT

Login before adding your answer.

Traffic: 2652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6