I would like to ask how to calculate heterozygosity from 1000 genomes data.
In the ENCODE study, one of the diversity index shown in Figure 1 was calculated from YRI population. The authors were written that "Heterozygosity was calculated basewise as 2pq, where p and q are allele frequencies estimated from the pilot sample of the 1000 Genomes YRI population". However, in the sample of 1000 genome data, there are a lot of multiallelic SNV such as
22 16051453 rs62224611 A C,G 100 PASS AC=478,17;AF=0.0954473,0.00339457;AN=5008;NS=2504;DP=22548;EAS_AF=0.0744,0;AMR_AF=0.1239,0;AFR_AF=0.003,0;EUR_AF=0.0746,0.003;SAS_AF=0.2434,0.0143;AA=.|||;VT=SNP;MULTI_ALLELIC
How can I calculate 2pq from this?
I assume that variations of heterozygosity in this case are "AC", "AG" and "CG".
For calculation of heterozygosity "H", allele frequencies are
Allele A: p
Allele C: q
Allele G: r,
H = 2pq + 2pr + 2rq.
Is this OK?
Thanks a lot!