Intenger encoding versus one-hot encoding for chromosome as a predictor in a neural network
Entering edit mode
4 months ago
Jimmy ▴ 30

I am creating a neural network and I want to use chromosome as a predictor for the feature I am interested in. There's two ways I can encode chromosome number to do this: integer encoding and one-hot encoding. In the former, chr01 becomes 1, chr02 becomes 2, etc and it's all fed in. In the latter, my one column representing chromosome number turns into a number of columns equal to the number of chromosomes in my organism of interest. So if it has 20 chromosomes, I get 20 columns, all but one of which are 0/False with the true chromosome that the feature is on being 1/True.

The drawback of the first method is that it imposes an ordinal relationship between chromosome numbers. The latter greatly increases the dimensionality of my data. Any suggestions as to which one I should go with, or which is more typical to use?

neural-network predictor chromosome • 277 views
Entering edit mode
4 months ago
dsull ★ 5.5k

Definitely one hot encode imho. You don't want the model to capture something like chr20 > chr19 > ... > chr1.

20 chromosomes (19 features) is nothing for a model -- if you're using something like sequence features, the dimensionality of your data will be greatly increased by features other than a few chromosomes.

Anyway, that being said, it can be something you tune and test yourself on some held-out validation set. I've always got better results with one-hot encoding, but you can always try both and see :)


Login before adding your answer.

Traffic: 1556 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6