I have a multi-sample VCF file (say comprising of 5 samples) created by GATK HaplotypeCaller. The 'FORMAT' field of each sample contains GT:AD:DP:GQ:PL values. Now I want to calculate mean GQ value for all the five samples, so that I may filter VCF file based on average/commulative GQ value.
FORMAT field of the vcf file: GT:AD:DP:GQ:PL 1/1:0,21:21:63:736,63,0 0/0:3,0:3:9:0,9,84
In concordance to first question, what is more suitable to filter vcf based on average GQ or commulative GQ?
I am confused here. Mean is the mean. You calculate average. That is it. You can filter by it say with vcftools and awk:
vcftools --vcf input.vcf --extract-FORMAT-info GQ --out input.vcf
this creates file with chromosome, position and GQ columns for each sample in input.vcf.GQ.FORMAT file. Assume (double check and add special IDs) that chromosome with pos are unique within the file (this is not mandatory in vcf format), then you can use awk to filter by mean GQ or you can add a custom annotation to your vcf file using that table, say with vcftools' vcf-annotate, and filter later with more options.
(unfortunately, I do not know if vcftools/bcftools or other common tools allow for aggregate function for filtration on multisample vcf files' genotype info, this is why we use NoSQL database developed by us at ALAPY.com for storage and access)
- Depends on what are these 5 samples and why are you looking for a certain mutation. Also if these are within same library prep and sequencing run? For example, case of 5 technical replicates is very different say from trio analysis with tumor/normal tissues for proband...