Parse VEP VCF with variantAnnotation
0
3
Entering edit mode
4.0 years ago
jeni ▴ 90

Hi!

I am trying to obtain a data frame from a vcf read with VariantAnnotation package. This vcf is the output of VEP (variant effect predictor), so, the columns corresponding to its annotations are not properly separated and I cannot parse them to different columns of a dataframe:

This is the line corresponding to VEP annotation info header:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_typeFeature|BIOTYPE|EXON|INTRON">

While the header for the rest of INFO fields follow this pattern:

##INFO=<ID=CONTQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to contamination">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=ECNT,Number=1,Type=Integer,Description="Number of events in this haplotype">

as you can see, CSQ is the corresponding field to VEP annotation, which includes a lot of different parameters separated by "|" (Allele|Consequence|IMPACT,etc). This means that every field of data given by VEP is writen in the same field as a string with '|' separating them, instead of being writen as different INFO fields.

Is there any way to transform these field names as column names, separating each field inside CSQ by |; while maintaining the parsing of the rest of the INFO fields (CONTQ, DP, ECNT as colums with their corresponding values).

snp R • 3.3k views
ADD COMMENT
2
Entering edit mode

Shameless Self Promotion I recently wrote an extension for R parsing VEP. https://github.com/lindenb/rbcf

# load rbcf
library(rbcf)
# A vcf
fp <- bcf.open(filename,FALSE)
vc <- NULL
while(!is.null(vc<-bcf.next(fp))) {
    if(variant.has.attribute(vc,"CSQ")) break;
    }
if(!is.null(vc)) {
    predictions<-variant.vep(vc)
    }
bcf.close(fp)
predictions



   Allele             Consequence   IMPACT   SYMBOL            Gene
1       C downstream_gene_variant MODIFIER   KLHL17 ENSG00000187961
2       A downstream_gene_variant MODIFIER   KLHL17 ENSG00000187961
3       C downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
4       A downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
5       C          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
6       A          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
7       C          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
8       A          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
9       C          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
10      A          intron_variant MODIFIER  PLEKHN1 ENSG00000187583
11      C downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
12      A downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
13      C downstream_gene_variant MODIFIER C1orf170 ENSG00000187642
(...)
ADD REPLY
0
Entering edit mode

For which R version is 'rbcf' available? I cannot install it in R 3.6.1. I have been reading the git link, but I cannot install it as it is indicated there because I am working from my local pc (which is windows), is it possible to install it directly with R?

ADD REPLY
0
Entering edit mode

Hey, nice tool, this is exactly what I am looking for :)

Testing went fine, except I only always get one single variant in the "predictions" - any idea why ? And could you also add how to best combine this table with the genotypes ?

ADD REPLY
0
Entering edit mode

Yes I am also getting the same issue, the tool only seems to pull the first variant, how do you get it to work for all variantrs?

ADD REPLY
0
Entering edit mode

as you can see, each element is separated by "|" instead of semicolons ";"

i don't see it. I see the CSQ tag which is a STRING. The 3 others TAG define a number.

ADD REPLY
0
Entering edit mode

Okay I've edited the question explaining it better

ADD REPLY

Login before adding your answer.

Traffic: 2704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6