Question

Data Transformations for machine learning

0

Entering edit mode

5.6 years ago

druggable ▴ 60

Hi,

I have multiple features as input for machine learning. My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression and neural network. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are (because they are already normally distributed), while the others are log-transformed.

Thanks, teabonng

machine learning data transformation • 1.7k views

ADD COMMENT • link 5.6 years ago by druggable ▴ 60

0

Entering edit mode

Hi teabonng,

It is unclear how this question is related to bioinformatics, which is the scope of Biostars. Please elaborate or this question might get closed for being off topic.

Cheers,
Wouter

ADD REPLY • link 5.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi Wouter,

My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are, while the others are log-transformed.

Thanks.

ADD REPLY • link 5.6 years ago by druggable ▴ 60

0

Entering edit mode

Then why don't you mention that?

ADD REPLY • link 5.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Please plot the distributions. It is not really good practice to use different distributions, as the models will [by default] assume that they are the same. Ÿou will have to standardise the 2 distributions.

ADD REPLY • link 5.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

Thanks for your reply. I have decided to log transform all the features. Then do the standardization so that they have more or less the same range. Then use as input for the neural network. Would this make sense?

Thanks, teabonng

ADD REPLY • link 5.6 years ago by druggable ▴ 60

1

Entering edit mode

Sounds good. I have done this before for metabolomics datasets. Just be aware that there is still likely bias in the data somewhere when you do this.

ADD REPLY • link 5.6 years ago by Kevin Blighe 87k

0

Entering edit mode

It would still help to clarify what you mean by "the different features have different distributions"? Are these ChIP-seq normalised counts and metadata? Presumably, at least all of the ChIP-seq data has been processed in the same way.

ADD REPLY • link 5.3 years ago by Kevin Blighe 87k