3.1 years ago by
Republic of Ireland
NB - this does NOT support multi-allelic records. If you have these, split them via bcftools norm -m-any
----------------------
Assuming that you're confident that your VCF is normalised and that you genuinely just want to count occurrences of heterozygous calls per line, then the following will work for either phased (0|1 or 1|0) or unphased (0/1 or 1/0) genotypes, or a mixture of these.
Here, I'm actually accessing a BCF and it's the imputed 1000 Genomes Phase III data. So, the genotypes are phased.
The gsub
function in AWK conveniently returns the number of matched patterns.
paste <(bcftools view 1000Genomes.Norm.bcf |\
awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} \
!/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \
\
<(bcftools query -f '[\t%SAMPLE=%GT]\n' 1000Genomes.Norm.bcf |\
awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}')
CHR POS ID REF ALT nHet
1 10177 rs367896724 A AC 1490
1 10235 rs540431307 T TA 6
1 10352 rs555500075 T TA 2025
1 10505 rs548419688 A T 1
1 10506 rs568405545 C G 1
1 10511 rs534229142 G A 1
1 10539 rs537182016 C A 3
1 10542 rs572818783 C T 1
1 10579 rs538322974 C A 1
1 10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C 35
1 10642 rs558604819 G A 21
1 11008 rs575272151 C G 403
1 11012 rs544419019 C G 403
1 11063 rs561109771 T G 15
1 13011 rs574746232 T G 3
1 13110 rs540538026 G A 134
1 13116 rs62635286 T G 414
1 13118 rs200579949 A G 414