Question: problem for 2 different normalization methods in train and test data set
gravatar for modarzi
6 months ago by
modarzi80 wrote:

I have to use 2 data sets. one of them for train and another one for test. My train data set belongs to TCGA and its workflow type is FPKM-UQ (also RNA-seq by HTSeq-FPKM and HTSeq-count are available in GDC portal) but my test RNA-seq was normalized by Transcript per million (TPM) method. My problem is for getting reasonable result should I use these 2 dataset convert to unique normalized method or I can use one data set by FPKM-UQ normalization method and one data set for testing my model by Transcript per million (TPM) method?

I appreciate if anybody share his/her comment with me.

Best Regards,


rna-seq tpm fpkm-uq • 208 views
ADD COMMENTlink modified 6 months ago by Charles Warden7.2k • written 6 months ago by modarzi80
gravatar for Charles Warden
6 months ago by
Charles Warden7.2k
Duarte, CA
Charles Warden7.2k wrote:

I can think of at least two possibilities:

1) Request access to controlled access TCGA data. I think higher impact papers often do some re-processing of that raw data. You might still have batch effects between library types that can't completely be corrected, but you can run a program like RSeQC to get an idea about how big a deal the library batch effects may be.

2) Define a signature that doesn't precisely depend on the exact gene expression levels. To some extent, I would guess a lower throughput method (such as qPCR) is what is actually going to be used in the clinic (so, I would guess a high-thoughput paper identifying features with promising features will probably be further modified at least one more time before going into a clinical trial). Nevertheless, I've seen some signatures robust enough to be reproduced between platforms using BD-Func (comparing the up-regulated gene distribution against the down-regulated gene distribution), or you might be able to find success between platforms using ssGSEA (when directional information is not in your gene set, or you don't have both up- and down-regulated genes).

ADD COMMENTlink written 6 months ago by Charles Warden7.2k

Dear Dr. Warden I have no access to controlled access TCGA data. basically, I would like to solve my problem based on computationally methods for available data. For this reason I want to know when my test data set normalization method is TPM for training data set using which workflow type of TCGA dataset is good: FPKM, FPKM-UQ , or HTseq-count?

I appreciate if you share your comment with me.

Best Regards

ADD REPLYlink modified 6 months ago • written 6 months ago by modarzi80


I can't really promise any method will be the best option ahead of time - I would expect some troubleshooting / benchmarking to be necessary for each project.

The best option that I can think of is to split the TCGA up into training and validation datasets (or, really validation set #1 and validation set #2). You can test out different strategies for the "training" or "validation set #1" and try to resist the temptation to check the "validation" or "validation set #2" until you start to feel comfortable with an option. The real test comes from testing predictions in completely new samples (and you may decide to use something lower throughput and less expensive than RNA-Seq for that). However, that is currently the best guideline that I can think of.

Best Wishes, Charles

ADD REPLYlink written 6 months ago by Charles Warden7.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 572 users visited in the last hour