Wondering how I can get the number of informative sites in a .vcf file using VCFtools. By informative I mean at least two samples share a variant. Any suggestions? Thanks.
Wondering how I can get the number of informative sites in a .vcf file using VCFtools. By informative I mean at least two samples share a variant. Any suggestions? Thanks.
using vcffilterjs: https://github.com/lindenb/jvarkit/wiki/VCFFilterJS add INFORMATIVE in the FILTER column for the variant having less than two samples having more than one genotype hom-ref or het. extract the FILTER column, count the number of line containing INFORMATIVE
cat input.vcf |\
java -jar dist/vcffilterjs.jar -F INFORMATIVE -e 'function accept(v) { var f=0,i;for(i=0;i<v.getNSamples();++i) {var g=v.getGenotype(i); f+=(g.isHomVar() || g.isHet()?1:0);} return f<2;}accept(variant);' |\
grep -v "^#" | cut -f 7 | grep -c INFORMATIVE
vcftools --gzvcf vcf_file --mac 2 --stdout --recode | fgrep -v '#' | wc -l
MOD-EDIT: OP has opened a new question for this here: Identifying private and shared SNPs using VCFtools
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Not sure about VCFtools, but if you are up for trying something new, the Variant Effect Predictor gives you information on MAF for data in .vcf files. Look at the Filtering options available for the VEP including frequency i.e. MAF.
GATK indicates they do this for their best practices and then makes the reader scavenger around the internet to find how to do this instead of giving a resource. Shame.