Why multiple SYMBOLS, Consequences... for Variant Effect Predictor (VEP)
8 weeks ago
gernophil ▴ 10

Hey everyone,

I have a question about the VEP results. Why are there for some variants multiple features like consequence, gene symbol, ensemble id...? And why does it get more, if I have more samples?

Shouldn't the gene be specified by the position on the genome?

An example is this (around 20 samples, after bcftools +vep-split):

CHROM   POS REF ALT ID  Consequence SYMBOL  Existing_variation  VARIANT_CLASS   Gene
17  81645307    G   A   .   intron_variant&non_coding_transcript_variant,non_coding_transcript_exon_variant,missense_variant,missense_variant,missense_variant&NMD_transcript_variant,regulatory_region_variant NPLOC4,TSPAN10,TSPAN10,TSPAN10,TSPAN10,.    rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617 SNV,SNV,SNV,SNV,SNV,SNV ENSG00000182446,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,.


The same variant with around 500 samples (including the above 17):

CHROM   POS REF ALT ID  Consequence SYMBOL  Existing_variation  VARIANT_CLASS   Gene
17  81645307    G   A   .   intron_variant&non_coding_transcript_variant,intron_variant&non_coding_transcript_variant,non_coding_transcript_exon_variant,non_coding_transcript_exon_variant,missense_variant,missense_variant,missense_variant,missense_variant,missense_variant&NMD_transcript_variant,missense_variant&NMD_transcript_variant,regulatory_region_variant,regulatory_region_variant NPLOC4,NPLOC4,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,.,.   rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617 SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV ENSG00000182446,ENSG00000182446,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,.,.


One explanation for multiple entries that I could think of could be that a variant can sit in a coding region for one gene and in a regulatory region for another. However, this does not explain, why there's a different amount at different n. Can someone explain this to me? My VCF are called with Haplotypecaller per sample and then merged and the merged VCF is then annotated.

8 weeks ago
barslmn ★ 1.2k

Ensembl VEP annotates for every allele, gene and transcript. You can flag or pick alleles or transcripts with pick options.

https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick

If you add pick flags you can explode this line with -d option of bcftools split-vep, you can select later the annotations you're interested in with bcftools expressions like -i 'PICK~"1"'