Question: Calculate mean_GQ value from individual GQ values in a multi-sample VCF file
gravatar for aham
3.6 years ago by
aham40 wrote:
  1. I have a multi-sample VCF file (say comprising of 5 samples) created by GATK HaplotypeCaller. The 'FORMAT' field of each sample contains GT:AD:DP:GQ:PL values. Now I want to calculate mean GQ value for all the five samples, so that I may filter VCF file based on average/commulative GQ value.
    FORMAT field of the vcf file: GT:AD:DP:GQ:PL 1/1:0,21:21:63:736,63,0 0/0:3,0:3:9:0,9,84

  2. In concordance to first question, what is more suitable to filter vcf based on average GQ or commulative GQ?

ADD COMMENTlink modified 15 months ago by jalinir0 • written 3.6 years ago by aham40
gravatar for Petr Ponomarenko
3.6 years ago by
United States / Los Angeles /
Petr Ponomarenko2.6k wrote:
  1. I am confused here. Mean is the mean. You calculate average. That is it. You can filter by it say with vcftools and awk:

    vcftools --vcf input.vcf --extract-FORMAT-info GQ --out input.vcf

this creates file with chromosome, position and GQ columns for each sample in input.vcf.GQ.FORMAT file. Assume (double check and add special IDs) that chromosome with pos are unique within the file (this is not mandatory in vcf format), then you can use awk to filter by mean GQ or you can add a custom annotation to your vcf file using that table, say with vcftools' vcf-annotate, and filter later with more options.

(unfortunately, I do not know if vcftools/bcftools or other common tools allow for aggregate function for filtration on multisample vcf files' genotype info, this is why we use NoSQL database developed by us at for storage and access)

  1. Depends on what are these 5 samples and why are you looking for a certain mutation. Also if these are within same library prep and sequencing run? For example, case of 5 technical replicates is very different say from trio analysis with tumor/normal tissues for proband...
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Petr Ponomarenko2.6k

This is what I have been looking for. I applied it to my work and it went well. I have chrom, pos, and sample with their respective GQ values. How do I get the average across all the samples for each site (i.e chrom and pos). I need the average so that I can plot it in R.

ADD REPLYlink written 21 months ago by mab65870
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2176 users visited in the last hour