How can I normalize RPKM data from TCGA (pan-cancer analysis)?
1
0
Entering edit mode
3.7 years ago

I have a matrix with different miRNA RPKM values downloaded from TCGA, relatively to different TCGA projects (BRCA, LAML, LUAD ecc.) columns: TCGA-barcodes, rows: miRNa identifier.

In order to perform a machine learning analysis how can I normalize all this data between the patients in my matrix? I searched all around the web but I couldn't find any answer.

I'm really a novice in bioinformatics and computational biology, and any advice is strongly appreciated. Thank you very much.

RNA-Seq RPKM pan-cancer TCGA normalization • 1.4k views
ADD COMMENT
0
Entering edit mode
3.7 years ago

RPKM already is normalized.

ADD COMMENT
0
Entering edit mode

I know, but I meant between the patients, considering that I've data from different projects

ADD REPLY
2
Entering edit mode

You can convert rpkm to log scale and perform vst

ADD REPLY
0
Entering edit mode

Thank you, after this, when I have the vst normalized data (using the DEseq2 package, isn't it?), it is the same of having counts data transformed using the same vst function?. For instance, if I have a RPKM dataset converted using first log scale then vst and also a counts dataset normalized with the vst function, are they comparable in terms of normalization? Thank you very much

ADD REPLY
0
Entering edit mode

@dare_devil, Ok I tried but log scaled RPKM are also negative in some cases and the vst function doesn't work on negative values. How can I handle with this?

ADD REPLY
2
Entering edit mode

You should have a matrix of RPKM values greater than or equal to 1. In order to achieve this you can add 1 to entire data frame then convert to log scale to avoid negative values.

ADD REPLY
0
Entering edit mode

Thank you.

Now the problem is that I downloaded some data from GEO (Tumoral Breast vs Normal Breast samples), in particular this is the code: GSE68085, I suppose that data is already log2 normalized and some negative values are in it. I want to use this data as a validation dataset (I'm using an svm classifier): I've downloaded the series matrix and I used the batch ID information for the batch correction with comBat function. Should I do the inverse exponential function and then apply vst?

Thank you very much again.

ADD REPLY
3
Entering edit mode

In this case, I would suggest nneg in NMF package

#read the rpkm values
exp= read.table("rpkm.txt", header = TRUE, sep = "\t", row.names = 1)
#Convert as a matrix
d = as.matrix(exp)
#Remove negative values
data_non_neg <- nneg(d, method = 'pmax')

This will convert all negative values to 0

You can go through this link for other methods

ADD REPLY
0
Entering edit mode

You can convert the log2 scaled data to their corresponding RPKM values using inverse function. I looked at your data GSE68085. But, I don't think they are log transformed values

ADD REPLY
0
Entering edit mode

Thank you! Ok, but these data is described as "normalized" I can't understand what type of normalization they did, do they just refer to RPKM? And if so, why do we have negative values? I red the series matrix and I could not find any other useful info. Thanks again.

ADD REPLY
2
Entering edit mode

You can download the data and redo the analysis. You can find its raw data here for download

ADD REPLY

Login before adding your answer.

Traffic: 2850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6