Question: DESeq2 - Between sample normalization of train/test subsets
gravatar for ml_researcher
18 months ago by
ml_researcher20 wrote:


I would like to apply machine learning methods on RNA-seq data from the TCGA dataset for the purpose of survival time analysis. The samples have to be comparable, so I understand I should use between-sample normalization methods like DESeq2.

I would like to split my dataset to train/test subsets.

1) Is it possible to normalize the training set only using DESeq2 and later use it for normalizing the test set, so the test samples will not affect the normalization of the training set?

2) Will normalizing the training and test subsets separately result in non-comparable samples between train and test?

3) Are there other between-sample normalization methods, which are better than upper quartile normalization, for this purpose?


ADD COMMENTlink modified 18 months ago by Asaf8.4k • written 18 months ago by ml_researcher20
gravatar for Asaf
18 months ago by
Asaf8.4k wrote:

I'll answer in reverse order:

3) DESeq doesn't use upper quartile normalization, it uses another, better method.

2) It all depends whether you are using the actual values of gene expression in the ML or just the ranks or relation between them. You might as well not normalize the data then.

1) I would say normalize everything together or select a set of genes that will be used for normalization, you might be able to use it in separate runs if you assume that these genes have the same expression level overall.

ADD COMMENTlink written 18 months ago by Asaf8.4k

Thank you for your answer.

3) Are there another between-sample normalization methods, which can allow me to fit a normalizer to the training set, and later I will be able to use it to normalize the test set?

2) I would like the absolute value to be comparable, i.e. if two samples have a given gene with a value of X, then they have the same meaning.

1) I do not want a leakage of information from the test set to the training set, but I still would like them to be comparable. Is it acceptable to use only some genes for getting the scaling factors and use them for the normalization of the other genes? I assume that other genes should have different scaling factors.


ADD REPLYlink modified 18 months ago • written 18 months ago by ml_researcher20

You can have a look at this paper for normalization methods:

The raw counts can tell a lot depending on the machine learning you're using. If you take the library depth as an input then, again, depending on the algorithm, you might be okay with raw data.

I agree that separating the test and train will be best, it will also mean that you could use the tool on a new dataset. I would suggest to use a set of predefined genes for the normalization. I don't know if all the samples are from the same tissue (or organism?) so that you'll have such a set.

ADD REPLYlink written 18 months ago by Asaf8.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1661 users visited in the last hour