Question: DESeq2 - Between sample normalization of train/test subsets
0
gravatar for ml_researcher
9 months ago by
ml_researcher0 wrote:

Hello,

I would like to apply machine learning methods on RNA-seq data from the TCGA dataset for the purpose of survival time analysis. The samples have to be comparable, so I understand I should use between-sample normalization methods like DESeq2.

I would like to split my dataset to train/test subsets.

1) Is it possible to normalize the training set only using DESeq2 and later use it for normalizing the test set, so the test samples will not affect the normalization of the training set?

2) Will normalizing the training and test subsets separately result in non-comparable samples between train and test?

3) Are there other between-sample normalization methods, which are better than upper quartile normalization, for this purpose?

Thanks

ADD COMMENTlink modified 9 months ago by Asaf6.5k • written 9 months ago by ml_researcher0
0
gravatar for Asaf
9 months ago by
Asaf6.5k
Israel
Asaf6.5k wrote:

I'll answer in reverse order:

3) DESeq doesn't use upper quartile normalization, it uses another, better method.

2) It all depends whether you are using the actual values of gene expression in the ML or just the ranks or relation between them. You might as well not normalize the data then.

1) I would say normalize everything together or select a set of genes that will be used for normalization, you might be able to use it in separate runs if you assume that these genes have the same expression level overall.

ADD COMMENTlink written 9 months ago by Asaf6.5k

Thank you for your answer.

3) Are there another between-sample normalization methods, which can allow me to fit a normalizer to the training set, and later I will be able to use it to normalize the test set?

2) I would like the absolute value to be comparable, i.e. if two samples have a given gene with a value of X, then they have the same meaning.

1) I do not want a leakage of information from the test set to the training set, but I still would like them to be comparable. Is it acceptable to use only some genes for getting the scaling factors and use them for the normalization of the other genes? I assume that other genes should have different scaling factors.

Thanks

ADD REPLYlink modified 9 months ago • written 9 months ago by ml_researcher0
1

You can have a look at this paper for normalization methods: https://academic.oup.com/bib/article/14/6/671/189645

The raw counts can tell a lot depending on the machine learning you're using. If you take the library depth as an input then, again, depending on the algorithm, you might be okay with raw data.

I agree that separating the test and train will be best, it will also mean that you could use the tool on a new dataset. I would suggest to use a set of predefined genes for the normalization. I don't know if all the samples are from the same tissue (or organism?) so that you'll have such a set.

ADD REPLYlink written 9 months ago by Asaf6.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1134 users visited in the last hour