I am very new to this so apologies in advance..
I am trying to teach myself the basics of R and bioinformatics and would like to attempt some analysis of TCGA data
I have downloaded the RNA-seq data from the LUAD TCGA data set using the 'HTSeq-Counts' link from this page:
I would like just do a simple correlation of expression of gene y with gene z
I then want to perform Kaplan-Meier analysis of overall survival after dividing patients into high and low expression of gene y (using the median to split the cohorts)
My questions are about normalising the data, before I perform the analyses
1) The UCSC page explains that the data has been log(x+1) transformed. I would like to know if the raw data was normalised for library size prior to log transformation? If not, is this necessary?
I obtained the count matrix and back-transformed the counts ((2^x) - 1) and I then summed the total counts per sample and obtained different values per sample, which makes me assume that the counts were not corrected for library size
2) Finally, assuming that the data have not been normalised for library size (or distribution etc) what method would you suggest to normalise for my analysis. I understand I could just do TPM or could do TMM or another method such as that used by DESeq2
Thanks in advance for any help, I did try to find the answer to q1 elsewhere but couldn't