Question

Parse VEP VCF with variantAnnotation

3

Entering edit mode

4.2 years ago

jeni ▴ 90

Hi!

I am trying to obtain a data frame from a vcf read with VariantAnnotation package. This vcf is the output of VEP (variant effect predictor), so, the columns corresponding to its annotations are not properly separated and I cannot parse them to different columns of a dataframe:

This is the line corresponding to VEP annotation info header:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_typeFeature|BIOTYPE|EXON|INTRON">

While the header for the rest of INFO fields follow this pattern:

##INFO=<ID=CONTQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to contamination">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=ECNT,Number=1,Type=Integer,Description="Number of events in this haplotype">

as you can see, CSQ is the corresponding field to VEP annotation, which includes a lot of different parameters separated by "|" (Allele|Consequence|IMPACT,etc). This means that every field of data given by VEP is writen in the same field as a string with '|' separating them, instead of being writen as different INFO fields.

Is there any way to transform these field names as column names, separating each field inside CSQ by |; while maintaining the parsing of the rest of the INFO fields (CONTQ, DP, ECNT as colums with their corresponding values).

snp R • 3.4k views

ADD COMMENT • link 4.2 years ago by jeni ▴ 90

2

Entering edit mode

Shameless Self Promotion I recently wrote an extension for R parsing VEP. https://github.com/lindenb/rbcf

# load rbcf
library(rbcf)
# A vcf
fp <- bcf.open(filename,FALSE)
vc <- NULL
while(!is.null(vc<-bcf.next(fp))) {
    if(variant.has.attribute(vc,"CSQ")) break;
    }
if(!is.null(vc)) {
    predictions<-variant.vep(vc)
    }
bcf.close(fp)
predictions



   Allele             Consequence   IMPACT   SYMBOL            Gene
1       C downstream_gene_variant MODIFIER   KLHL17 ENSG00000187961
2       A downstream_gene_variant MODIFIER   KLHL17 ENSG00000187961
3       C downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
4       A downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
5       C          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
6       A          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
7       C          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
8       A          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
9       C          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
10      A          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
11      C downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
12      A downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
13      C downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
(...)

ADD REPLY • link 4.2 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

For which R version is 'rbcf' available? I cannot install it in R 3.6.1. I have been reading the git link, but I cannot install it as it is indicated there because I am working from my local pc (which is windows), is it possible to install it directly with R?

ADD REPLY • link 4.2 years ago by jeni ▴ 90

0

Entering edit mode

Hey, nice tool, this is exactly what I am looking for :)

Testing went fine, except I only always get one single variant in the "predictions" - any idea why ? And could you also add how to best combine this table with the genotypes ?

ADD REPLY • link 3.4 years ago by jan.haas • 0

0

Entering edit mode

Yes I am also getting the same issue, the tool only seems to pull the first variant, how do you get it to work for all variantrs?

ADD REPLY • link 3.4 years ago by alexandriapinto1 • 0

0

Entering edit mode

as you can see, each element is separated by "|" instead of semicolons ";"

i don't see it. I see the CSQ tag which is a STRING. The 3 others TAG define a number.