Is the heterozygosity flag (--het) in vcftools calculate observed and expected heterozygosity?
1
3
Entering edit mode
5.0 years ago

Hi All,

I am still beginner in this field, I just want to make sure about what I am doing. I have used vcftools to calculate heterozygosity of my vcf file, which contains one population, please see the command below.

vcftools --vcf input.vcf --het --out output.het


what I have got is the file below:

INDV    O(HOM)  E(HOM)          N_SITES            F
1   8292008 8649694.8   12260583    -0.09906
10  8675596 8649619.3   12260443    0.00719
11  8874072 8648987.5   12259527    0.06234
13a  8874825    8648830.1   12259283    0.06259
14  8893536 8649143.0   12259772    0.06769
2   8439004 8648868.3   12259414    -0.05813
3   8687517 8649371.6   12260068    0.01056
4   8483412 8649541.0   12260342    -0.04601
5   9064207 8648823.1   12259333    0.11505
6   8512310 8649564.9   12260422    -0.03801
7   9065816 8648660.4   12259099    0.11554
8b  8990089 8649252.6   12259906    0.09440
9   8710054 8649450.6   12260170    0.01678f


If you notice from the result, it is HOM not Het from the head of the file.

my question ((is this homozygosity or heterozygosity)) if is it homo, how can I calculate expected and observed heterozygosity.

Thank you all,

Ahmed

vcftools vcf • 12k views
0
Entering edit mode

Dear All,

Is there any explanation for the above please?

Thanks,

Ahmed

2
Entering edit mode
3.7 years ago

From one of the VCFtools authors:

The output of the --het function is a summary for each individual of the observed number of homozygous sites ( O(hom) ) and the expected number of homozygous sites ( E(hom) ). It also includes the total number of sites that the individual has data for and the inbreeding coefficient F.

[source: https://sourceforge.net/p/vcftools/mailman/message/33590266/]

Kevin

0
Entering edit mode

Thank you for this, Kevin. We are also trying to wrap our minds around the best diversity descriptive statistics to provide for high-throughput sequencing data. Observed and expected heterozygosity have been staple descriptive statistics for microsatellite data sets for a long time, which may be why some are keen to figure this one out!

With that said, I'm having trouble understanding the formula on observed heterozygosity above. Although I'm still pretty new to this, wouldn't the following formula be more indicative of proportion of observed heterozygosity?

N_Sites - O(HOM) = O(HET)

N_Sites - E(HOM) = E(HET)

Then, after that, you can do the following formula to provide an individual proportion of expected and observed heterozygous sites:

O(HET) / N_Sites = Proportion Observed Heterozygous Sites

E(HET) / N_Sites = Proportion Expected Heterozygous Sites

All thoughts welcome if I'm far off base here.

0
Entering edit mode

E(HOM) - O(HOM) = O(HET)

This looks wrong, shouldn't it be:

#Sites - O(HOM) = O(HET)


?

1
Entering edit mode

I have edited out my formula.

0
Entering edit mode

Heys, I'm interested in calculating genome-wide heterozygosity. How did you solve it?