Question: Normalisation of RNAseq data from UCSC Xena Browser
gravatar for oncdoc86
27 days ago by
oncdoc860 wrote:

Dear all

I am very new to this so apologies in advance..

I am trying to teach myself the basics of R and bioinformatics and would like to attempt some analysis of TCGA data

I have downloaded the RNA-seq data from the LUAD TCGA data set using the 'HTSeq-Counts' link from this page:

I would like just do a simple correlation of expression of gene y with gene z

I then want to perform Kaplan-Meier analysis of overall survival after dividing patients into high and low expression of gene y (using the median to split the cohorts)

My questions are about normalising the data, before I perform the analyses

1) The UCSC page explains that the data has been log(x+1) transformed. I would like to know if the raw data was normalised for library size prior to log transformation? If not, is this necessary?

I obtained the count matrix and back-transformed the counts ((2^x) - 1) and I then summed the total counts per sample and obtained different values per sample, which makes me assume that the counts were not corrected for library size

2) Finally, assuming that the data have not been normalised for library size (or distribution etc) what method would you suggest to normalise for my analysis. I understand I could just do TPM or could do TMM or another method such as that used by DESeq2

Thanks in advance for any help, I did try to find the answer to q1 elsewhere but couldn't

rna-seq • 155 views
ADD COMMENTlink modified 26 days ago • written 27 days ago by oncdoc860

Hey Kevin

Huge thanks for this, appreciate it. I'll convert back as you've explained and then try using DESeq2 to normalising before analysis. Will have a go at your tutorial for survival...

p.s. after I messaged I did a bit more digging and noticed that Xena has 2 'versions' TCGA for download. One is the GDC version which I am using and another is, I guess, the 'legacy' data. The latter is RSEM normalised and then logged apparently.

ADD REPLYlink written 26 days ago by oncdoc860

Yes, you can also use the RSEM estimated counts for input to DESeq2 via tximport - I have done this many times with TCGA data. The DESeq2 Vignette goes over all of these options.

ADD REPLYlink written 26 days ago by Kevin Blighe37k
gravatar for Kevin Blighe
26 days ago by
Kevin Blighe37k
Republic of Ireland
Kevin Blighe37k wrote:


The data HTSeq - Counts (n=585) relates to raw counts that have been incremented and logged (base 2) - these are not normalised (so, there is no library size adjustment). I am honestly not sure why Xena make these raw counts available as logged counts because using these for statistical analyses makes little sense.

Note the other message:

Data from the same sample but from different vials/portions/analytes/aliquotes is averaged

For the other files that relate to FPKM counts, those are normalised for library size but there is no cross sample normalisation; so, those are not suitable for differential expression analysis.

I would continue with the HTSeq counts and convert them back to integer counts via as.integer(((2^x) - 1)), as you have more or less already done. Then, I would use these as input to DESeq2 or EdgeR, where they would then be normalised 'properly'. For the correlation analysis, I would eventually proceed to log-transforming the normalised counts from DESeq2 or EdgeR, and perform the correlation on those logged counts using Pearson correlation. I would use these same log-transformed normalised counts for survival analysis. I have a tutorial here for conducting survival analysis but from microarray data: Survival analysis with gene expression (there are many other tutorials online).


ADD COMMENTlink written 26 days ago by Kevin Blighe37k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 965 users visited in the last hour