**4.7k**wrote:

How are features extracted and encoded from a genotype matrix / VCF file for downstream statistical purposes?

Are genotypes encoded as

```
HOM_REF = 0
HET = 1
HOM_ALT = 2
```

This preserved a measure of distance between the genotypes (distance is 1 between HET and HOM_REF and HOM_ALT). But wat is done with missing genotypes or are they set to -1 or -99 or something?

Or is one hot encoding used per variant to encode the 4 possible genotypes?

```
MISSING = [1000]
HOM_REF = [0100]
HET = [0010]
HOM_ALT = [0001]
```

This loses the measure of distance between the genotypes but includes the missing genotype.

Is the 0,1,2 or the one hot encoded matrix then converted to a sparse matrix to save disk/memory storage and computation cost?

ie. Only storing the HET and HOM_ALT genotypes as (index, value) tuples, assuming the rest is HOM_REF. This can save 90% of the disk and memory storage.

In the case of the 0,1,2 encoded matrix a sparse matrix would be problematic because you can't differentiate between MISSING and HOM_REF?