Question

problem for 2 different normalization methods in train and test data set

1

Entering edit mode

5.1 years ago

modarzi ▴ 170

I have to use 2 data sets. one of them for train and another one for test. My train data set belongs to TCGA and its workflow type is FPKM-UQ (also RNA-seq by HTSeq-FPKM and HTSeq-count are available in GDC portal) but my test RNA-seq was normalized by Transcript per million (TPM) method. My problem is for getting reasonable result should I use these 2 dataset convert to unique normalized method or I can use one data set by FPKM-UQ normalization method and one data set for testing my model by Transcript per million (TPM) method?

I appreciate if anybody share his/her comment with me.

Best Regards,

Mohammad

RNA-Seq FPKM-UQ TPM • 1.1k views

ADD COMMENT • link updated 5.1 years ago by Charles Warden 8.2k • written 5.1 years ago by modarzi ▴ 170

score 1 · Answer 1 · 2019-03-14

1

Entering edit mode

5.1 years ago

Charles Warden 8.2k

I can think of at least two possibilities:

1) Request access to controlled access TCGA data. I think higher impact papers often do some re-processing of that raw data. You might still have batch effects between library types that can't completely be corrected, but you can run a program like RSeQC to get an idea about how big a deal the library batch effects may be.

2) Define a signature that doesn't precisely depend on the exact gene expression levels. To some extent, I would guess a lower throughput method (such as qPCR) is what is actually going to be used in the clinic (so, I would guess a high-thoughput paper identifying features with promising features will probably be further modified at least one more time before going into a clinical trial). Nevertheless, I've seen some signatures robust enough to be reproduced between platforms using BD-Func (comparing the up-regulated gene distribution against the down-regulated gene distribution), or you might be able to find success between platforms using ssGSEA (when directional information is not in your gene set, or you don't have both up- and down-regulated genes).

ADD COMMENT • link 5.1 years ago by Charles Warden 8.2k

0

Entering edit mode

Dear Dr. Warden I have no access to controlled access TCGA data. basically, I would like to solve my problem based on computationally methods for available data. For this reason I want to know when my test data set normalization method is TPM for training data set using which workflow type of TCGA dataset is good: FPKM, FPKM-UQ , or HTseq-count?

I appreciate if you share your comment with me.

Best Regards

ADD REPLY • link 5.1 years ago by modarzi ▴ 170

1

Entering edit mode

Hi,

I can't really promise any method will be the best option ahead of time - I would expect some troubleshooting / benchmarking to be necessary for each project.

The best option that I can think of is to split the TCGA up into training and validation datasets (or, really validation set #1 and validation set #2). You can test out different strategies for the "training" or "validation set #1" and try to resist the temptation to check the "validation" or "validation set #2" until you start to feel comfortable with an option. The real test comes from testing predictions in completely new samples (and you may decide to use something lower throughput and less expensive than RNA-Seq for that). However, that is currently the best guideline that I can think of.

Best Wishes, Charles

ADD REPLY • link 5.1 years ago by Charles Warden 8.2k