Parsing VEP tab-delimited output (the "Extra" column)
1
1
Entering edit mode
6 weeks ago
Ram 37k

Hello,

I'm looking for an existing script (widely used, well tested, etc.) that can parse TAB delimited output from VEP. From a bunch of online searches, there seems to exist tools such as bcftools plugin split-vep, Pierre's bcfr, etc that can parse VEP's VCF output to proper tab delimited format where columns are consistent across the dataset.

However, VEP's own tab delimited format is pretty irregular, with the "Extra" column containing a variable number of KEY-VALUE annotation pairs per variant. Is there a tool that can accept a VEP TSV file and create a structured TSV where all KEYs are in their own columns and the VALUEs in that column are either the appropriate VALUE from the KEY=VALUE entry or a "." if the KEY does not exist for that entry? For example, here are two sample input lines:

#Uploaded_variation  Location    Allele  Gene             Feature            Feature_type  Consequence                                   cDNA_position  CDS_position  Protein_position  Amino_acids  Codons  Existing_variation  Extra
rs62635297           chr1:14653  T       ENSG00000223972  ENST00000456328.2  Transcript    downstream_gene_variant                       -              -             -                 -            -       rs62635297          IMPACT=MODIFIER;DISTANCE=244;STRAND=1;PICK=1;SYMBOL=DDX11L1;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:37102;BIOTYPE=processed_transcript;CANONICAL=YES;TSL=1
rs62635297           chr1:14653  T       ENSG00000227232  ENST00000488147.1  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       rs62635297          IMPACT=MODIFIER;STRAND=-1;SYMBOL=WASH7P;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:38034;BIOTYPE=unprocessed_pseudogene;CANONICAL=YES;HGVSc=ENST00000488147.1:n.1254-152G>A

What I'd like as output:

#Uploaded_variation  Location    Allele  Gene             Feature            Feature_type  Consequence                                   cDNA_position  CDS_position  Protein_position  Amino_acids  Codons  Existing_variation  IMPACT    DISTANCE  STRAND  SYMBOL   SYMBOL_SOURCE  PICK  HGNC_ID     BIOTYPE                 CANONICAL  TSL  HGVSc
rs62635297           chr1:14653  T       ENSG00000223972  ENST00000456328.2  Transcript    downstream_gene_variant                       -              -             -                 -            -       rs62635297          MODIFIER  244       1       DDX11L1  HGNC           1     HGNC:37102  processed_transcript    YES        1    .
rs62635297           chr1:14653  T       ENSG00000227232  ENST00000488147.1  Transcript    intron_variant,non_coding_transcript_variant  -              -             -                 -            -       rs62635297          MODIFIER  .         -1      WASH7P   HGNC           .     HGNC:38034  unprocessed_pseudogene  YES        .    ENST00000488147.1:n.1254-152G>A

I could use some R to write a script myself, but I thought I'd check if there's something already out there that the community uses.

Thank you for your time!

tab vep • 319 views
ADD COMMENT
0
Entering edit mode

Not what you were asking but Pierre had this in a previous thread: Is there a better tool for visualizing your variants after annotation than Excel?

ADD REPLY
1
Entering edit mode
6 weeks ago

using awk and datamash

$ awk -F '\t' '(NR==1){split($0,header);next;} {K=sprintf("%s:%s:%s:%s",$2,$1,$3,$5);for(i=1;i<=13;i++) {printf("%s\t%s\t%s\n",K,header[i],$i);}N=split($14,A,/[;]/);for(i=1;i<=N;i++) {M=split(A[i],B,/[=]/);printf("%s\t%s\t%s\n",K,B[1],(M==2?B[2]:"."));}}' < input.txt  | sort -t $'\t' -k1,1 -k2,2 | datamash crosstab 1,2 unique 3  
#Uploaded_variation                        Allele      Amino_acids  BIOTYPE  CANONICAL               CDS_position  Codons  Consequence  DISTANCE                                      Existing_variation  Feature     Feature_type       Gene        HGNC_ID          HGVSc       IMPACT                           Location  PICK        Protein_position  STRAND  SYMBOL  SYMBOL_SOURCE  TSL   cDNA_position
chr1:14653:rs62635297:T:ENST00000456328.2  rs62635297  T            -        processed_transcript    YES           -       -            downstream_gene_variant                       244                 rs62635297  ENST00000456328.2  Transcript  ENSG00000223972  HGNC:37102  N/A                              MODIFIER  chr1:14653  1                 -       1       DDX11L1        HGNC  1              -
chr1:14653:rs62635297:T:ENST00000488147.1  rs62635297  T            -        unprocessed_pseudogene  YES           -       -            intron_variant,non_coding_transcript_variant  N/A                 rs62635297  ENST00000488147.1  Transcript  ENSG00000227232  HGNC:38034  ENST00000488147.1:n.1254-152G>A  MODIFIER  chr1:14653  N/A               -       -1      WASH7P         HGNC  N/A            -
ADD COMMENT
1
Entering edit mode

This is some impressive awk plus I've never ventured into datamash - might be a good time to start but it's still a custom one(?)-liner and I'd rather write a documented step-by-step R script than the custom-built awk.

I'm honestly considering just re-running vep with --vcf and using bcftools +split-vep just because of how standard bcftools is. Of course, I had to hack together the header there.

ADD REPLY

Login before adding your answer.

Traffic: 1131 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6