Question: Extracting annotation values from vcf files for plotting distribution
gravatar for kirannbishwa01
5.0 years ago by
United States
kirannbishwa011.3k wrote:

I want to extract the annotation values (QUAL, BaseQRankSum, ClippingRankSum, DP, FS, MQRankSum, etc.) of the variants (SNPs and indels) called in my genome reseq data and

1) I want to plot the distribution of these values before proceeding to stringent filtering.

2) I also want to plot the correlation between several annotation values for the called variants.

A part of the variants_MA605.vcf file looks like this:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    MA605
scaffold_1111    62    .    T    A    61.77    .    AC=1;AF=0.500;AN=2;BaseQRankSum=0.358;ClippingRankSum=-1.231;DP=5;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.19;MQRankSum=-1.231;QD=12.35;ReadPosRankSum=0.358;SOR=1.022    GT:AD:DP:GQ:PL    0/1:2,3:5:73:90,0,73
scaffold_1111    301    .    G    A    119.77    .    AC=1;AF=0.500;AN=2;BaseQRankSum=2.227;ClippingRankSum=-1.598;DP=73;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=27.33;MQRankSum=1.356;QD=1.64;ReadPosRankSum=1.404;SOR=0.596    GT:AD:DP:GQ:PL    0/1:59,11:70:99:148,0,1738
scaffold_1111    340    .    C    T    105.77    .    AC=1;AF=0.500;AN=2;BaseQRankSum=1.547;ClippingRankSum=-0.490;DP=33;FS=9.645;MLEAC=1;MLEAF=0.500;MQ=22.79;MQRankSum=1.351;QD=3.21;ReadPosRankSum=1.116;SOR=2.799    GT:AD:DP:GQ:PL    0/1:23,10:33:99:134,0,601

Using SnpSift (part of SnpEff); command:     java -jar SnpSift.jar extractFields variants_MA605.vcf CHROM POS ID AF QUAL > raw01VarMA605qual.txt

The output text file is like:

#CHROM    POS    ID    AF    QUAL
scaffold_1111    62        0.500    61.77
scaffold_1111    301        0.500    119.77
scaffold_1111    340        0.500    105.77

While the extraction of the QUAL values (and other string values: CHROM, REF, ALT) has been clear and straight forward I am not able to pull the annotation values for AC, BaseQRankSum, ClippingRankSum, etc. because they are multiple annotation values under INFO field. I have checked the documentation but its been not so clear and successful. How can I extract this INFO fields separately so I can test for correlation between the annotation values?

I have been SnpSift to get the values for QUAL in text file and R to do the distribution plotting. Is there any other tools than SnpSift that may do a better job of extracting the annotation and give the appropriate plots?

Thanks in advance !!!

ADD COMMENTlink modified 5.0 years ago by Manuel Landesfeind1.3k • written 5.0 years ago by kirannbishwa011.3k
gravatar for Pierre Lindenbaum
5.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

see my tool bioacidae:

ADD COMMENTlink written 5.0 years ago by Pierre Lindenbaum133k
gravatar for Andreas
5.0 years ago by
Andreas2.5k wrote:

Another tools to extract values from a vcf file: (requires pyvcf)


ADD COMMENTlink modified 13 months ago by _r_am32k • written 5.0 years ago by Andreas2.5k
gravatar for Manuel Landesfeind
5.0 years ago by
Göttingen, Germany
Manuel Landesfeind1.3k wrote:

Use bcftools query subcommand like this (untested):

bcftools query -f "%CHROM\t%POS\t%ID\t%INFO/AF\t%QUAL\t%INFO/BaseQRankSum" $vcf_file

However, also SnpSift seems to be able to do this. In fact, in your example you extracted the AF field which is an INFO-Tag! Or am I wrong??

ADD COMMENTlink modified 13 months ago by _r_am32k • written 5.0 years ago by Manuel Landesfeind1.3k

Thanks for seeing that. I was totally unaware that it had pulled the AF field values.Thanks for pointing that to me.

ADD REPLYlink written 5.0 years ago by kirannbishwa011.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1580 users visited in the last hour