I am learning the analysis of TCGA data. I realized I can get 3 kinds of RNA expression level information, raw count, scaled estimates(TPM) and upper quartile normalized RSEM count estimates. I am confused that which kind of data I shoud choose. For example, if I want to explore the correlation between RNA expression and clinical feature, which kind of data should be choosed? What can upper quartile normalized RSEM count estimates and scaled estimates(TPM) be used for respectively? Thank you!
My own golden rule in bioinformatics and data analysis: Always aim to get the data in its most raw form possible.
Apart from the fact that TPM and upper-quartile normalisation methods have been found to be not ideal, obtaining data in its most raw form in this situation will confer maximum control to you in terms of how you analyse the data. Granted, in time-pressure situations, this may not be ideal. Someone else may chirp in here and say that there are 100s of publications where these types of normalised counts have been used, but something being published doesn't allude to its quality at all, even if its a top tier journal. There are 1000s of GWAS studies published, for example, the vast proportion of whose results are not reproducible
Obtaining raw RNA-seq counts is neither an issue in terms of data processing anymore, because we now have super-rapid pseudo-aligners at our disposal, such as Kallisto and Salmon, which can process >500 samples in just a couple of days.
So, raw counts.