Question: Need suggestions in subsetting the Annovar annotated VCF file
gravatar for Riri
12 months ago by
Riri0 wrote:

Hello Everyone,

I am fairly novice Bioinformatician. I need some help and suggestions on tools that I can use to subset my annotated vcf file using specific criteria. The criteria are: (i) Coding and Splice site variants (ii) CADD > 10 if nonsynonymous SNPs (iii) AA change: Nonsense (iv) Absent in Exac database (v) Frequency is KAVIAR: 6.4E -06. I am working on the python code because I couldn't find any tool that serves my need. So far I have tried GATK's varianttotable, variantfiltration, bcftool, vcftool. I would like to know if there are any tools or tool out there which can parse the INFO column of vcf file and help to filter/subset the file based on selected criteria. Thank you in advance for your help!

next-gen • 339 views
ADD COMMENTlink modified 12 months ago by manuel.belmadani1.2k • written 12 months ago by Riri0
gravatar for manuel.belmadani
12 months ago by
manuel.belmadani1.2k wrote:

Using, you should get outputs for a VCF and a tabular .txt version of the results, so while the .vcf one has an INFO field that requires parsing, the tabular .txt file should already have the information you want in columns. It should be easy to filter by column after using python, any other standard programming language or shell tools like awk.

ADD COMMENTlink written 12 months ago by manuel.belmadani1.2k

Hi Manuel, Thank you for your response. I tried using tabular.txt to filter, but it is missing my Sample IDs that are present in the corresponding VCF file, so it is not very helpful. The VCF file I have is around 95 GB and it has 1048 samples. Is it normal for tabular.txt to not have Sample IDs?

ADD REPLYlink written 12 months ago by Riri0

I've typically only used annovar with single sample VCFs, but it looks possible if your VCF file is version 4.0, using -format vcf4 and -allsample:

By default "vcf4" will only process the first sample, and will only print out mutations that exist in the first sample. So if you have a multi-sample VCF file, then usually only a subset of lines will exist in the output file. The -format vcf4 can be combined with -allsample argument, which will print out a separate output file for each sample in the VCF4 file (again by default, only the first sample in the VCF4 file will be processed). More importantly, if you use -format vcf4 -allsample -withfreq, then all input lines from VCF will be kept in output lines, yet an allele frequency measure is included in each line calculating the frequency of each variant among all the samples in the VCF file.

ADD REPLYlink modified 12 months ago • written 12 months ago by manuel.belmadani1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1265 users visited in the last hour