I wanted to know why is it correct to transform SNP data to 0, 1, 2 format using a reference allele, for example: SNP1 with C/T alleles, transformation rules: CC = 2, CT = 1, TT = 0, to later apply machine learning algorithms for predict a specific trait?
I ask this because giving this ordinal values to SNP data may affect greatly the result of a classification model, since in a way, we are giving "more importance" to diploid "CC" with a bigger value of 2, than to diploid TT with a value 0.
Wouldn't it be better and correct to transform the data into a binary format, where each SNP feature will be transformed to 4 binary features: SNP1_CC, SNP1_CT, SNP1_TC, SNP1_TT. Following this, the sample:
ID SNP1 SNP2 1 CC AG
Will be transformed to:
ID SNP1_CC SNP1_CT SNP1_TC SNP1_TT SNP2_GG SNP2_GA SNP2_AG SNP2_AA 1 1 0 0 0 0 0 1 0