Question: Data Transformations for machine learning
0
gravatar for druggable
23 months ago by
druggable30
druggable30 wrote:

Hi,

I have multiple features as input for machine learning. My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression and neural network. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are (because they are already normally distributed), while the others are log-transformed.

Thanks, teabonng

ADD COMMENTlink modified 23 months ago • written 23 months ago by druggable30

Hi teabonng,

It is unclear how this question is related to bioinformatics, which is the scope of Biostars. Please elaborate or this question might get closed for being off topic.

Cheers,
Wouter

ADD REPLYlink written 23 months ago by WouterDeCoster44k

Hi Wouter,

My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are, while the others are log-transformed.

Thanks.

ADD REPLYlink written 23 months ago by druggable30

Then why don't you mention that?

ADD REPLYlink written 23 months ago by WouterDeCoster44k

Please plot the distributions. It is not really good practice to use different distributions, as the models will [by default] assume that they are the same. Ÿou will have to standardise the 2 distributions.

ADD REPLYlink modified 23 months ago • written 23 months ago by Kevin Blighe63k

Hi Kevin,

Thanks for your reply. I have decided to log transform all the features. Then do the standardization so that they have more or less the same range. Then use as input for the neural network. Would this make sense?

Thanks, teabonng

ADD REPLYlink written 23 months ago by druggable30
1

Sounds good. I have done this before for metabolomics datasets. Just be aware that there is still likely bias in the data somewhere when you do this.

ADD REPLYlink written 22 months ago by Kevin Blighe63k

It would still help to clarify what you mean by "the different features have different distributions"? Are these ChIP-seq normalised counts and metadata? Presumably, at least all of the ChIP-seq data has been processed in the same way.

ADD REPLYlink written 19 months ago by Kevin Blighe63k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1460 users visited in the last hour