Ensembl VEP filtering most_severe keeping Symbol, Feature,Consequence etc.
1
2
Entering edit mode
7.2 years ago
t2 ▴ 60

Hi all,

What I need to do is filter a file produced using non-stringent Variant Effect Predictor (VEP) settings with one that was produced with more stringent VEP settings.

I've been running VEP locally  using the cache option with a pre-built cache with this command on my vcfs:

perl $VEP --cache --dir$VEP_DIR --offline --input_file $input --output_file$output --sift b --polyphen b --regulatory --protein --symbol --ccds --uniprot --check_existing --gmaf --maf_1kg --maf_esp --pubmed

Everything works great and I'm super happy with the documentation. However, I realized after I had run my command on all my exomes that I would most likely get many entries for each particular variant depending on different Ensembl Feature IDs.

VEP has a fix for this, which is to use the --most_severe flag when running the command. That works perfectly, however, some extra flags are disabled when using the --most_severe flag. I would like to retain this extra information (like gene name/symbol Feature,Consequence, etc.) for the variants produced with the --most_severe flag.

perl $VEP --cache --dir$VEP_DIR --offline --input_file $input --output_file$output --regulatory --uniprot --check_existing --gmaf --maf_1kg --maf_esp --most_severe

So now I have two files for each vcf; 1) disabled --most_severe and 2) --most_severe. The 2nd file is basically a subset of the 1st file but with some important missing information.

In the 1st file when there are multiple entries for a variant, most of the fields are the same except the Feature_type field and often the Extra field.

Both produce a tab delimited text file with columns such as this:

#Uploaded_variation    Location    Allele    Gene    Feature    Feature_type    Consequence    cDNA_position    CDS_position    Protein_position    Amino_acids    Codons    Existing_variation    Extra

Is there a way to filter the 1st file with the 2nd file. I think I need to use fields, Uploaded_variation and Consequence for matching the 1st file because those are the fields that are unique in the line.

I think using awk to search for columns in both files won't work because there is some information lost in the Consequence field in the 2nd file

For example a variant Consequence may change from:

non_coding_transcript_exon_variant,non_coding_transcript_variant

to

non_coding_transcript_exon_variant

I appreciate any help in solving this issue. Alternatively there is a filter_vep script provided by VEP for post-VEP annotation filtering but I don't think there is an option here that will solve my problem.

Thanks,

Tesa

next-gen sequencing ensembl VEP • 5.0k views
0
Entering edit mode

7
Entering edit mode
7.2 years ago
EnsemblWill ▴ 560

You will encounter problems whichever way you try to combine these two files I'm afraid. For example, let's say your small (--most_severe on) file has a line with a consequence of missense_variant. Then in your large file, there are three corresponding lines of output for that variant with a consequence of missense_variant - how do you decide which to choose?

Also, the consequence type picked by --most_severe may be calculated relative to an Ensembl feature that does not have reliable biological evidence to support it - do you still want to choose this one over any others?

Is it practical you for you re-run your analyses? There are a couple of other options that you might find useful:

a) --pick : this chooses one line of consequence data (with all the fields retained) for each variant. It uses the following criteria to pick one:

1) is the transcript canonical
2) is the transcript biotype protein_coding
3) consequence rank
4) transcript length

In the forthcoming version of VEP you will be able to customise this order.

http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_pick

--pick_allele chooses one line per variant allele (i.e. this will come into effect when the input variant has more than one alternate allele)

b) --flag_pick : like --pick but just adds a flag to the line chosen by the same rules

c) --per_gene : like --pick but chooses one line per variant/gene combination

As a footnote, we always try to discourage people from using these summary flags if we can - there will always be cases where valuable data gets lost, and you are relying on an arbitrary and subjective algorithm to perform that summarising. The logic of this algorithm will always be wrong for some use case no matter how we code it. Thus by keeping all the data you are ensuring you don't miss anything.

3
Entering edit mode

Hi EnsemblWill,

Thanks so much for your detailed comments. Indeed it is feasible for me to re-run this portion of the analysis. I saw pick but I wasn't sure how I should implement it. You give me some good pointers to try to figure it out for my situation.

The new(ish) VEP is awesome and I am grateful that the good folks at Ensembl made it possible to use a command line version that is so well documented.

Cheers,

Tesa

0
Entering edit mode

Hi EnsemblWill,

Do you know how the magic works behind the pick option? Because I have some conserns about it.

As use case, a well documented mutation 11_66293652_T/G (ENSG00000174483/BBS1:M390R).

When using :

no filters : there are 21 results mapping on 3 ensembel gene  entries (ENSG00000174483: BBS1, ENSG00000174165: ZDHHC24 and ENSG00000256349: CTD-3074O7 which is a Vega definition of BBS1 but not the canonical when refering to CCDS)

--per_gene: only 3 results as expected.

--pick: I have only 1 result left but the chosen transcript is ENSG00000256349: CTD-3074O7!

--pick_order  canonical, tsl, biotype, rank, length: I get back the original 21 lines as when no filters were applied!

Am I missing something? Thanks for any help!

Best,

Kirsley