Question: How are features extracted and encoded from a genotype matrix / VCF file?
0
gravatar for William
4.2 years ago by
William4.7k
Europe
William4.7k wrote:

How are features extracted and encoded from a genotype matrix / VCF file for downstream statistical purposes?

Are genotypes encoded as

HOM_REF   = 0
HET       = 1
HOM_ALT   = 2

This preserved a measure of distance between the genotypes (distance is 1 between HET and HOM_REF and HOM_ALT). But wat is done with missing genotypes or are they set to -1 or -99 or something?

Or is one hot encoding used per variant to encode the 4 possible genotypes?

MISSING   = [1000]
HOM_REF   = [0100]
HET       = [0010]
HOM_ALT   = [0001]

This loses the measure of distance between the genotypes but includes the missing genotype.

Is the 0,1,2 or the one hot encoded matrix then converted to a sparse matrix to save disk/memory storage and computation cost?

ie. Only storing the HET and HOM_ALT genotypes as (index, value) tuples, assuming the rest is HOM_REF. This can save 90% of the disk and memory storage.

In the case of the 0,1,2 encoded matrix a sparse matrix would be problematic because you can't differentiate between MISSING and HOM_REF?

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by William4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2097 users visited in the last hour