How can I normalize RPKM data from TCGA (pan-cancer analysis)?
1
0
Entering edit mode
16 months ago

I have a matrix with different miRNA RPKM values downloaded from TCGA, relatively to different TCGA projects (BRCA, LAML, LUAD ecc.) columns: TCGA-barcodes, rows: miRNa identifier.

In order to perform a machine learning analysis how can I normalize all this data between the patients in my matrix? I searched all around the web but I couldn't find any answer.

I'm really a novice in bioinformatics and computational biology, and any advice is strongly appreciated. Thank you very much.

RNA-Seq RPKM pan-cancer TCGA normalization • 629 views
0
Entering edit mode
16 months ago

0
Entering edit mode

I know, but I meant between the patients, considering that I've data from different projects

2
Entering edit mode

You can convert rpkm to log scale and perform vst

0
Entering edit mode

Thank you, after this, when I have the vst normalized data (using the DEseq2 package, isn't it?), it is the same of having counts data transformed using the same vst function?. For instance, if I have a RPKM dataset converted using first log scale then vst and also a counts dataset normalized with the vst function, are they comparable in terms of normalization? Thank you very much

0
Entering edit mode

@dare_devil, Ok I tried but log scaled RPKM are also negative in some cases and the vst function doesn't work on negative values. How can I handle with this?

2
Entering edit mode

You should have a matrix of RPKM values greater than or equal to 1. In order to achieve this you can add 1 to entire data frame then convert to log scale to avoid negative values.

0
Entering edit mode

Thank you.

Now the problem is that I downloaded some data from GEO (Tumoral Breast vs Normal Breast samples), in particular this is the code: GSE68085, I suppose that data is already log2 normalized and some negative values are in it. I want to use this data as a validation dataset (I'm using an svm classifier): I've downloaded the series matrix and I used the batch ID information for the batch correction with comBat function. Should I do the inverse exponential function and then apply vst?

Thank you very much again.

3
Entering edit mode

In this case, I would suggest nneg in NMF package

#read the rpkm values
#Convert as a matrix
d = as.matrix(exp)
#Remove negative values
data_non_neg <- nneg(d, method = 'pmax')


This will convert all negative values to 0

You can go through this link for other methods

0
Entering edit mode

You can convert the log2 scaled data to their corresponding RPKM values using inverse function. I looked at your data GSE68085. But, I don't think they are log transformed values

0
Entering edit mode

Thank you! Ok, but these data is described as "normalized" I can't understand what type of normalization they did, do they just refer to RPKM? And if so, why do we have negative values? I red the series matrix and I could not find any other useful info. Thanks again.

2
Entering edit mode