VCF heterozygosity
Entering edit mode
3 months ago

Hello, I want some opinions. I am new to this and need to calculate the heterozygosity by contig. I have my VCF file, and I used six samples. I got the GT of the samples, and I ended in a file that looks like this:

SRHA02000001    316 1/1 1/1 ./. ./. ./. ./. 
SRHA02000001    413 1/1 1/1 ./. 1/1 ./. 1/1 
SRHA02000001    990 1/1 1/1 ./. ./. ./. 1/1 
SRHA02000001    3897    1/1 1/1 1/1 1/1 1/1 1/1 
SRHA02000003    4217    1/1 1/1 ./. ./. ./. 1/1 
SRHA02000003    7871    1/1 1/2 1/1 1/1 1/1 1/1 
SRHA02000003    11031   0/1 1/1 1/1 1/1 1/1 1/1 
SRHA02000002    11164   0/1 1/3 1/1 1/1 0/1 1/1 
SRHA02000002    11364   0/0 0/0 0/1 0/1 0/0 2/2 
SRHA02000002    11434   0/0 1/1 0/0 0/0 0/1 1/1

The first column is the contig id, the second is the position, and the others are GT from the samples. So, to calculate the heterozygosity by contig, I count the number of heterozygotes and divide it by the total (homozygotes + heterozygotes) in each contig without considering the ./. Do you think this is a good approach, or shall I consider, for example, the length of the contig.

contig heterozygosity • 656 views
Entering edit mode

what is the goal of performing this calculation? dependent upon the goal, the approach may differ slightly.


Entering edit mode

I am trying to create a database to use in ML later. The DB shall have the heterozygosity by contig. The main idea is to classify the contigs using this and other features.

Entering edit mode
3 months ago
LauferVA 4.2k

Hello erikatatianacs, thank you for your question. First, a caveat. I think I understand the principles here reasonably well, but I have not done this myself as the primary analyst. So, I will try to advise you the best I can, but my experience is limited.

I would describe the approach you mention as "a good start". However, it is incomplete as it is. At minimum, you need to study the characteristics of the variants and contigs before limiting yourself to such a simple approach. Ok, but why? Well consider these examples:

  1. Contig length. You're right, there are reasons to favor longer contigs. Most importantly, contig length - in certain cases - is correlated with the quality of the sequencing data as well as assembly accuracy. Consider that for a moment - it's problematic for you. Low quality sequencing often produces excess heterozygosity because some of the calls are spurious. But, since these spurious calls are unlikely to be found twice, those variants will almost always be heterozygous. For your use-case, that is bad ... it would be expected to impact that accuracy of your training data directly. To me, the appropriate first step is just to look. Just see - is contig length, overall, associated with sequence quality and % heterozygosity? If yes - you have more work to do. If no, it may be ok not to consider length. A second reason why longer contigs are favored is that they thought to be on average more likely to contain a diversity of sequences, simply because they are longer, and thus some people consider them to be a more representative sample.

  2. Variant Quality Score. I was surprised not to see quality score on your list. As you know, training datasets for Machine Learning should be precisely calibrated, otherwise the algorithm will learn the wrong things. So, you definitely want to try to retain as many true variants and remove as may spurious variants, and quality score is probably the single best way to do that. Other heuristics can be important ... for instance read depth and bidirectional read support for a variant, but these also tend to impact Q-score so further discussion is omitted.

  3. Linkage Disequilibrium, Hardy-Weinberg Equilibrium, and Allele Frequency Again due to the importance of rigorous preparation of training data, consider performing external validation steps to make sure your data are consistent with reports from population databases. For instance, if you find a variant that is WAY out of HWE, has Allele Frequency markedly different than what is reported for ancestry-matched individuals, or displays LD patterns with other nearby variants that differ from LD patterns in reference databases, you should further scrutinize those variants and potentially remove them.

  4. Structural Variation and Copy Number Variation The example you provided does not address the possiblity that a contig could contain information from 1, or 3 or 4 or some other number DNA strands due to CNV. This should be addressed.

  5. Statistical Modeling, Normalization, and ML based variant prediction algorithms Many of the issues above can be modeled. For instance, suppose that Contig Length does end up being confounded with sequence quality. Depending, it could possibly be preferable to model differences between contigs of different lengths rather than simply removing short contigs from analysis. Additionally, please note that variant prediction is itself an important field that is frequently studied using ML. It's likely to be worth your time to familiarize yourself with such approaches.

  6. Assessing for Reference Bias In this context, I would recommend assessing your dataset for reference bias, if applicable.

In summary, I would use the approach you describe as an initial step to generate more sophisticated heuristics based on 1. through 6.


Login before adding your answer.

Traffic: 1274 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6