Question: Data Transformations for machine learning
0
gravatar for teabonng
13 months ago by
teabonng20
teabonng20 wrote:

Hi,

I have multiple features as input for machine learning. My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression and neural network. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are (because they are already normally distributed), while the others are log-transformed.

Thanks, teabonng

ADD COMMENTlink modified 13 months ago • written 13 months ago by teabonng20

Hi teabonng,

It is unclear how this question is related to bioinformatics, which is the scope of Biostars. Please elaborate or this question might get closed for being off topic.

Cheers,
Wouter

ADD REPLYlink written 13 months ago by WouterDeCoster41k

Hi Wouter,

My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are, while the others are log-transformed.

Thanks.

ADD REPLYlink written 13 months ago by teabonng20

Then why don't you mention that?

ADD REPLYlink written 13 months ago by WouterDeCoster41k

Please plot the distributions. It is not really good practice to use different distributions, as the models will [by default] assume that they are the same. Ÿou will have to standardise the 2 distributions.

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe50k

Hi Kevin,

Thanks for your reply. I have decided to log transform all the features. Then do the standardization so that they have more or less the same range. Then use as input for the neural network. Would this make sense?

Thanks, teabonng

ADD REPLYlink written 13 months ago by teabonng20
1

Sounds good. I have done this before for metabolomics datasets. Just be aware that there is still likely bias in the data somewhere when you do this.

ADD REPLYlink written 13 months ago by Kevin Blighe50k

It would still help to clarify what you mean by "the different features have different distributions"? Are these ChIP-seq normalised counts and metadata? Presumably, at least all of the ChIP-seq data has been processed in the same way.

ADD REPLYlink written 9 months ago by Kevin Blighe50k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1576 users visited in the last hour