Question

How are features extracted and encoded from a genotype matrix / VCF file?

0

Entering edit mode

7.7 years ago

William ★ 5.3k

How are features extracted and encoded from a genotype matrix / VCF file for downstream statistical purposes?

Are genotypes encoded as

HOM_REF   = 0
HET       = 1
HOM_ALT   = 2

This preserved a measure of distance between the genotypes (distance is 1 between HET and HOM_REF and HOM_ALT). But wat is done with missing genotypes or are they set to -1 or -99 or something?

Or is one hot encoding used per variant to encode the 4 possible genotypes?

MISSING   = [1000]
HOM_REF   = [0100]
HET       = [0010]
HOM_ALT   = [0001]

This loses the measure of distance between the genotypes but includes the missing genotype.

Is the 0,1,2 or the one hot encoded matrix then converted to a sparse matrix to save disk/memory storage and computation cost?

ie. Only storing the HET and HOM_ALT genotypes as (index, value) tuples, assuming the rest is HOM_REF. This can save 90% of the disk and memory storage.

In the case of the 0,1,2 encoded matrix a sparse matrix would be problematic because you can't differentiate between MISSING and HOM_REF?

vcf statistics feature extraction • 2.2k views

ADD COMMENT • link updated 2.2 years ago by P ▴ 10 • written 7.7 years ago by William ★ 5.3k

score 0 · Answer 1 · 2022-02-08

0

Entering edit mode

2.2 years ago

P ▴ 10

Hi, Did you ever have an answer for this issue? I am trying to do something similar. Thanks!

ADD COMMENT • link 2.2 years ago by P ▴ 10