Hey everyone,
I have a question about the VEP results. Why are there for some variants multiple features like consequence, gene symbol, ensemble id...? And why does it get more, if I have more samples?
Shouldn't the gene be specified by the position on the genome?
An example is this (around 20 samples, after bcftools +vep-split
):
CHROM POS REF ALT ID Consequence SYMBOL Existing_variation VARIANT_CLASS Gene
17 81645307 G A . intron_variant&non_coding_transcript_variant,non_coding_transcript_exon_variant,missense_variant,missense_variant,missense_variant&NMD_transcript_variant,regulatory_region_variant NPLOC4,TSPAN10,TSPAN10,TSPAN10,TSPAN10,. rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617 SNV,SNV,SNV,SNV,SNV,SNV ENSG00000182446,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,.
The same variant with around 500 samples (including the above 17):
CHROM POS REF ALT ID Consequence SYMBOL Existing_variation VARIANT_CLASS Gene
17 81645307 G A . intron_variant&non_coding_transcript_variant,intron_variant&non_coding_transcript_variant,non_coding_transcript_exon_variant,non_coding_transcript_exon_variant,missense_variant,missense_variant,missense_variant,missense_variant,missense_variant&NMD_transcript_variant,missense_variant&NMD_transcript_variant,regulatory_region_variant,regulatory_region_variant NPLOC4,NPLOC4,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,TSPAN10,.,. rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617,rs6565617 SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV,SNV ENSG00000182446,ENSG00000182446,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,ENSG00000182612,.,.
One explanation for multiple entries that I could think of could be that a variant can sit in a coding region for one gene and in a regulatory region for another. However, this does not explain, why there's a different amount at different n. Can someone explain this to me?
My VCF are called with Haplotypecaller
per sample and then merged and the merged VCF is then annotated.