What I need to do is filter a file produced using non-stringent Variant Effect Predictor (VEP) settings with one that was produced with more stringent VEP settings.
I've been running VEP locally using the cache option with a pre-built cache with this command on my vcfs:
perl $VEP --cache --dir $VEP_DIR --offline --input_file $input --output_file $output --sift b --polyphen b --regulatory --protein --symbol --ccds --uniprot --check_existing --gmaf --maf_1kg --maf_esp --pubmed
Everything works great and I'm super happy with the documentation. However, I realized after I had run my command on all my exomes that I would most likely get many entries for each particular variant depending on different Ensembl Feature IDs.
VEP has a fix for this, which is to use the --most_severe flag when running the command. That works perfectly, however, some extra flags are disabled when using the --most_severe flag. I would like to retain this extra information (like gene name/symbol Feature,Consequence, etc.) for the variants produced with the --most_severe flag.
perl $VEP --cache --dir $VEP_DIR --offline --input_file $input --output_file $output --regulatory --uniprot --check_existing --gmaf --maf_1kg --maf_esp --most_severe
So now I have two files for each vcf; 1) disabled --most_severe and 2) --most_severe. The 2nd file is basically a subset of the 1st file but with some important missing information.
In the 1st file when there are multiple entries for a variant, most of the fields are the same except the Feature_type field and often the Extra field.
Both produce a tab delimited text file with columns such as this:
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
Is there a way to filter the 1st file with the 2nd file. I think I need to use fields, Uploaded_variation and Consequence for matching the 1st file because those are the fields that are unique in the line.
I think using awk to search for columns in both files won't work because there is some information lost in the Consequence field in the 2nd file
For example a variant Consequence may change from:
I appreciate any help in solving this issue. Alternatively there is a filter_vep script provided by VEP for post-VEP annotation filtering but I don't think there is an option here that will solve my problem.