How are features extracted and encoded from a genotype matrix / VCF file?
1
0
Entering edit mode
5.8 years ago
William ★ 5.0k

How are features extracted and encoded from a genotype matrix / VCF file for downstream statistical purposes?

Are genotypes encoded as

HOM_REF   = 0
HET       = 1
HOM_ALT   = 2


This preserved a measure of distance between the genotypes (distance is 1 between HET and HOM_REF and HOM_ALT). But wat is done with missing genotypes or are they set to -1 or -99 or something?

Or is one hot encoding used per variant to encode the 4 possible genotypes?

MISSING   = [1000]
HOM_REF   = [0100]
HET       = [0010]
HOM_ALT   = [0001]


This loses the measure of distance between the genotypes but includes the missing genotype.

Is the 0,1,2 or the one hot encoded matrix then converted to a sparse matrix to save disk/memory storage and computation cost?

ie. Only storing the HET and HOM_ALT genotypes as (index, value) tuples, assuming the rest is HOM_REF. This can save 90% of the disk and memory storage.

In the case of the 0,1,2 encoded matrix a sparse matrix would be problematic because you can't differentiate between MISSING and HOM_REF?

vcf statistics feature extraction • 1.7k views
0
Entering edit mode
3 months ago
P ▴ 10

Hi, Did you ever have an answer for this issue? I am trying to do something similar. Thanks!