Question: Extracting annotation values from vcf files for plotting distribution
4.8 years ago by
United States
kirannbishwa01 wrote:

I want to extract the annotation values (QUAL, BaseQRankSum, ClippingRankSum, DP, FS, MQRankSum, etc.) of the variants (SNPs and indels) called in my genome reseq data and

1) I want to plot the distribution of these values before proceeding to stringent filtering.

2) I also want to plot the correlation between several annotation values for the called variants.

A part of the variants_MA605.vcf file looks like this:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    MA605
scaffold_1111    62    .    T    A    61.77    .    AC=1;AF=0.500;AN=2;BaseQRankSum=0.358;ClippingRankSum=-1.231;DP=5;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.19;MQRankSum=-1.231;QD=12.35;ReadPosRankSum=0.358;SOR=1.022    GT:AD:DP:GQ:PL    0/1:2,3:5:73:90,0,73
scaffold_1111    301    .    G    A    119.77    .    AC=1;AF=0.500;AN=2;BaseQRankSum=2.227;ClippingRankSum=-1.598;DP=73;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=27.33;MQRankSum=1.356;QD=1.64;ReadPosRankSum=1.404;SOR=0.596    GT:AD:DP:GQ:PL    0/1:59,11:70:99:148,0,1738
scaffold_1111    340    .    C    T    105.77    .    AC=1;AF=0.500;AN=2;BaseQRankSum=1.547;ClippingRankSum=-0.490;DP=33;FS=9.645;MLEAC=1;MLEAF=0.500;MQ=22.79;MQRankSum=1.351;QD=3.21;ReadPosRankSum=1.116;SOR=2.799    GT:AD:DP:GQ:PL    0/1:23,10:33:99:134,0,601

Using SnpSift (part of SnpEff); command:     java -jar SnpSift.jar extractFields variants_MA605.vcf CHROM POS ID AF QUAL > raw01VarMA605qual.txt

The output text file is like:

#CHROM    POS    ID    AF    QUAL
scaffold_1111    62        0.500    61.77
scaffold_1111    301        0.500    119.77
scaffold_1111    340        0.500    105.77

While the extraction of the QUAL values (and other string values: CHROM, REF, ALT) has been clear and straight forward I am not able to pull the annotation values for AC, BaseQRankSum, ClippingRankSum, etc. because they are multiple annotation values under INFO field. I have checked the documentation but its been not so clear and successful. How can I extract this INFO fields separately so I can test for correlation between the annotation values?

I have been SnpSift to get the values for QUAL in text file and R to do the distribution plotting. Is there any other tools than SnpSift that may do a better job of extracting the annotation and give the appropriate plots?

Thanks in advance !!!

4.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum wrote:

see my tool bioacidae:

4.8 years ago by
Andreas wrote:

Another tools to extract values from a vcf file: (requires pyvcf)


4.8 years ago by
Göttingen, Germany
Manuel Landesfeind wrote:

Use bcftools query subcommand like this (untested):

bcftools query -f "%CHROM\t%POS\t%ID\t%INFO/AF\t%QUAL\t%INFO/BaseQRankSum" $vcf_file

However, also SnpSift seems to be able to do this. In fact, in your example you extracted the AF field which is an INFO-Tag! Or am I wrong??

Thanks for seeing that. I was totally unaware that it had pulled the AF field values.Thanks for pointing that to me.

kirannbishwa01 wrote:
