Question: Is it 100% correct to transform SNP data (categorical) to (0, 1, 2) format to apply ML algorithms later? Why not binary (0, 1) data?
gravatar for mgvaldesgraterol
2.6 years ago by
mgvaldesgraterol10 wrote:

I wanted to know why is it correct to transform SNP data to 0, 1, 2 format using a reference allele, for example: SNP1 with C/T alleles, transformation rules: CC = 2, CT = 1, TT = 0, to later apply machine learning algorithms for predict a specific trait?

I ask this because giving this ordinal values to SNP data may affect greatly the result of a classification model, since in a way, we are giving "more importance" to diploid "CC" with a bigger value of 2, than to diploid TT with a value 0.

Wouldn't it be better and correct to transform the data into a binary format, where each SNP feature will be transformed to 4 binary features: SNP1_CC, SNP1_CT, SNP1_TC, SNP1_TT. Following this, the sample:


Will be transformed to:

ID SNP1_CC SNP1_CT SNP1_TC SNP1_TT SNP2_GG SNP2_GA SNP2_AG SNP2_AA 1 1 0 0 0 0 0 1 0

snp • 1.8k views
ADD COMMENTlink modified 2.6 years ago by Giovanni M Dall'Olio26k • written 2.6 years ago by mgvaldesgraterol10

I don't think because you transform it to categorical 0, 1, 2 that it's necessarily ranked 0 < 1 < 2 Could as well transform it to categorical "donkey" (homozygous reference), "pig" (homogygous variant) and "chicken" (heterozygous). It's a label.

ADD REPLYlink written 2.6 years ago by WouterDeCoster36k

I understand, what you say that it is just a re-labeling, but what I mean is to transform categorical data to numerical data, so I can apply ML methods that use numeric data and do not support categorical data. Is it still correct?

ADD REPLYlink written 2.6 years ago by mgvaldesgraterol10

Absolutely not. A 2 is not double the effect of a 1 for a simple dominant trait. And only zero is affected in a simple recessive trait. A numeric-only ML algo will absolutely screw this up.

ADD REPLYlink written 2.6 years ago by karl.stamm3.4k

I'm confussed, sorry... So you say (0, 1, 2) as numeric data is an incorrect input for a ML algorithm? And a (0, 1) encoding would be more appropriate one?

ADD REPLYlink written 2.6 years ago by mgvaldesgraterol10

No. Numeric input (0, 1, 2) is incorrect. Categorical input (0, 1, 2) is fine.

ADD REPLYlink written 2.6 years ago by WouterDeCoster36k

I already told you that I understand that categorical input (0, 1, 2) is ok, because this would be just relabeling the data. But this is not what I'm asking, I'm asking what kind of numerical transformation of SNP categorical data is better to later apply ML algorithms that use as input, only numeric data.

ADD REPLYlink written 2.6 years ago by mgvaldesgraterol10

There is no appropriate transformation. You could argue that homozygous for the most prevalent allele is the least likely to be harmful and could be encoded as 0/neutral. But as John wrote, there are examples of heterozygous advantages compared to both homozygous types. So no general good rules. I like the idea of applying ML to variant data, but you should know that most likely most variants are harmless or with minimal effect... and just adding noise.

ADD REPLYlink written 2.6 years ago by WouterDeCoster36k

Every genetic haplotype has the potential to result in a totally different phenotype. It might be that 0/0 is bad, 0/1 makes you healthier, and 1/1 gives you sickle-cell anaemia.

More over, haplotypes in isolation might not make much sense either. 1/1 of allele A might cause cancer, but 1/1 of A and 1/1 of allele B might cancel each other out, and in the process protect you from other cancers. Bottom line, some assumptions and simplification of the real problem will have to be done in your model, and it's more important that you respect the assumptions made, than pick the "best" assumption and then pretend your model is the best possible model without any limitations. What i'm saying is, choosing to turn categorical data into continuous data will sensitise your model to diseases that work that way - and that might be a good thing. Choosing a model where every genotype is it's own independent observation might sensitise your model for more complex-trait diseases, and miss more obvious ones.

ADD REPLYlink written 2.6 years ago by John12k
gravatar for Giovanni M Dall'Olio
2.6 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

If I understand well, in this case each variant would be encoded by 4 binary numbers.

However the problem is that these 4 numbers would be related among themselves (if one is 1, the other must necessarily be 0), meaning that you can't use them as independent observations. This is usually bad for any machine learning or regression method.

One alternative may be to use only haplotypes instead of genotypes, e.g. have one string for every copy of a chromosome. This will only work if the data are phased, and there are no triallelic SNPs.

ADD COMMENTlink written 2.6 years ago by Giovanni M Dall'Olio26k

Independent observations? This are variables/features...

My dataset looks similar to somethig like this... with only biallelic SNPs.


So actually I can have 16 possible combinations of ACGT alleles. So my resulting dataset would be transformed so that each SNP will be represented by 16 columns with 0/1 values.

Is this correct? I'm sorry If I'm not understanding, I'm a computer scientist and I'm new to this bio-informatics world and I'm very new to all this biology terminology.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by mgvaldesgraterol10

In for example GWAS it does not matter whether you use GG - AG - AA or any other combination since you just checking the difference in minor allele frequency or genotypes between cases and controls. SNPs can have indeed those 16 combinations, but for GWAS etc that is not really relevant, this information is more relevant for follow up and identifying the actual causal effect. Biallelic SNPs have one reference allele and an alternative allele and for initial analyses it does not matter which combination (GA CC GG TT AA AC TG AT) or in which format (0 , 1 or 2) .

ADD REPLYlink written 2.6 years ago by Floris Brenk870

I understand what you are saying, but the fact is that I'm not performing GWAS, I want to apply machine learning algorithms to SNP data of this kind to predict/classify different traits.

ADD REPLYlink written 2.6 years ago by mgvaldesgraterol10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1761 users visited in the last hour