What I need to do is filter a file produced using non-stringent Variant Effect Predictor (VEP) settings with one that was produced with more stringent VEP settings.
I've been running VEP locally using the cache option with a pre-built cache with this command on my vcfs:
perl $VEP \ --cache \ --dir $VEP_DIR \ --offline \ --input_file $input \ --output_file $output \ --sift b \ --polyphen b \ --regulatory \ --protein \ --symbol \ --ccds \ --uniprot \ --check_existing \ --gmaf \ --maf_1kg \ --maf_esp \ --pubmed
Everything works great and I'm super happy with the documentation. However, I realized after I had run my command on all my exomes that I would most likely get many entries for each particular variant depending on different Ensembl Feature IDs.
VEP has a fix for this, which is to use the
--most_severe flag when running the command. That works perfectly, however, some extra flags are disabled when using the
--most_severe flag. I would like to retain this extra information (like gene name/symbol Feature,Consequence, etc.) for the variants produced with the
perl $VEP \ --cache \ --dir $VEP_DIR \ --offline \ --input_file $input \ --output_file $output \ --regulatory \ --uniprot \ --check_existing \ --gmaf \ --maf_1kg \ --maf_esp \ --most_severe
So now I have two files for each vcf; 1) disabled
--most_severe and 2)
--most_severe. The 2nd file is basically a subset of the 1st file but with some important missing information.
In the 1st file when there are multiple entries for a variant, most of the fields are the same except the
Feature_type field and often the
Both produce a tab delimited text file with columns such as this:
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
Is there a way to filter the 1st file with the 2nd file. I think I need to use fields
Consequence for matching the 1st file because those are the fields that are unique in the line.
I think using awk to search for columns in both files won't work because there is some information lost in the Consequence field in the 2nd file
For example a variant Consequence may change from:
I appreciate any help in solving this issue. Alternatively there is a
filter_vep script provided by VEP for post-VEP annotation filtering but I don't think there is an option here that will solve my problem.