Question

Intenger encoding versus one-hot encoding for chromosome as a predictor in a neural network

0

Entering edit mode

6 months ago

Jimmy ▴ 30

I am creating a neural network and I want to use chromosome as a predictor for the feature I am interested in. There's two ways I can encode chromosome number to do this: integer encoding and one-hot encoding. In the former, chr01 becomes 1, chr02 becomes 2, etc and it's all fed in. In the latter, my one column representing chromosome number turns into a number of columns equal to the number of chromosomes in my organism of interest. So if it has 20 chromosomes, I get 20 columns, all but one of which are 0/False with the true chromosome that the feature is on being 1/True.

The drawback of the first method is that it imposes an ordinal relationship between chromosome numbers. The latter greatly increases the dimensionality of my data. Any suggestions as to which one I should go with, or which is more typical to use?

neural-network predictor chromosome • 320 views

ADD COMMENT • link updated 6 months ago by Ram 43k • written 6 months ago by Jimmy ▴ 30

score 1 · Answer 1 · 2023-10-25

Definitely one hot encode imho. You don't want the model to capture something like chr20 > chr19 > ... > chr1.

20 chromosomes (19 features) is nothing for a model -- if you're using something like sequence features, the dimensionality of your data will be greatly increased by features other than a few chromosomes.

Anyway, that being said, it can be something you tune and test yourself on some held-out validation set. I've always got better results with one-hot encoding, but you can always try both and see :)