Dear all

I am very new to this so apologies in advance..

I am trying to teach myself the basics of R and bioinformatics and would like to attempt some analysis of TCGA data

I have downloaded the RNA-seq data from the LUAD TCGA data set using the 'HTSeq-Counts' link from this page:

I would like just do a simple correlation of expression of gene y with gene z

I then want to perform Kaplan-Meier analysis of overall survival after dividing patients into high and low expression of gene y (using the median to split the cohorts)

My questions are about normalising the data, before I perform the analyses

1) The UCSC page explains that the data has been log(x+1) transformed. **I would like to know if the raw data was normalised for library size prior to log transformation? If not, is this necessary?**

I obtained the count matrix and back-transformed the counts ((2^x) - 1) and I then summed the total counts per sample and obtained different values per sample, which makes me assume that the counts were not corrected for library size

2) Finally, assuming that the data have not been normalised for library size (or distribution etc) what method would you suggest to normalise for my analysis. I understand I could just do TPM or could do TMM or another method such as that used by DESeq2

Thanks in advance for any help, I did try to find the answer to q1 elsewhere but couldn't

Hey Kevin

Huge thanks for this, appreciate it. I'll convert back as you've explained and then try using DESeq2 to normalising before analysis. Will have a go at your tutorial for survival...

p.s. after I messaged I did a bit more digging and noticed that Xena has 2 'versions' TCGA for download. One is the GDC version which I am using and another is, I guess, the 'legacy' data. The latter is RSEM normalised and then logged apparently.

https://xenabrowser.net/datapages/?cohort=TCGA%20Lung%20Adenocarcinoma%20(LUAD)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

Yes, you can also use the RSEM estimated counts for input to DESeq2 via tximport - I have done this many times with TCGA data. The DESeq2 Vignette goes over all of these options.