Question

DESeq2 - Between sample normalization of train/test subsets

3

Entering edit mode

6.3 years ago

ml_researcher ▴ 30

Hello,

I would like to apply machine learning methods on RNA-seq data from the TCGA dataset for the purpose of survival time analysis. The samples have to be comparable, so I understand I should use between-sample normalization methods like DESeq2.

I would like to split my dataset to train/test subsets.

1) Is it possible to normalize the training set only using DESeq2 and later use it for normalizing the test set, so the test samples will not affect the normalization of the training set?

2) Will normalizing the training and test subsets separately result in non-comparable samples between train and test?

3) Are there other between-sample normalization methods, which are better than upper quartile normalization, for this purpose?

Thanks

RNA-Seq normalization DESeq2 TCGA survival • 3.2k views

ADD COMMENT • link updated 6.3 years ago by Asaf 10k • written 6.3 years ago by ml_researcher ▴ 30

score 0 · Answer 1 · 2019-03-11

0

Entering edit mode

6.3 years ago

Asaf 10k

I'll answer in reverse order:

3) DESeq doesn't use upper quartile normalization, it uses another, better method.

2) It all depends whether you are using the actual values of gene expression in the ML or just the ranks or relation between them. You might as well not normalize the data then.

1) I would say normalize everything together or select a set of genes that will be used for normalization, you might be able to use it in separate runs if you assume that these genes have the same expression level overall.

ADD COMMENT • link 6.3 years ago by Asaf 10k

0

Entering edit mode

Thank you for your answer.

3) Are there another between-sample normalization methods, which can allow me to fit a normalizer to the training set, and later I will be able to use it to normalize the test set?

2) I would like the absolute value to be comparable, i.e. if two samples have a given gene with a value of X, then they have the same meaning.

1) I do not want a leakage of information from the test set to the training set, but I still would like them to be comparable. Is it acceptable to use only some genes for getting the scaling factors and use them for the normalization of the other genes? I assume that other genes should have different scaling factors.

Thanks

ADD REPLY • link 6.3 years ago by ml_researcher ▴ 30

1

Entering edit mode

You can have a look at this paper for normalization methods: https://academic.oup.com/bib/article/14/6/671/189645

The raw counts can tell a lot depending on the machine learning you're using. If you take the library depth as an input then, again, depending on the algorithm, you might be okay with raw data.

I agree that separating the test and train will be best, it will also mean that you could use the tool on a new dataset. I would suggest to use a set of predefined genes for the normalization. I don't know if all the samples are from the same tissue (or organism?) so that you'll have such a set.

ADD REPLY • link 6.3 years ago by Asaf 10k