Normalisation of RNAseq data from UCSC Xena Browser
1
5
Entering edit mode
3.0 years ago
oncdoc86 ▴ 50

Dear all

I am very new to this so apologies in advance..

I am trying to teach myself the basics of R and bioinformatics and would like to attempt some analysis of TCGA data

I would like just do a simple correlation of expression of gene y with gene z

I then want to perform Kaplan-Meier analysis of overall survival after dividing patients into high and low expression of gene y (using the median to split the cohorts)

My questions are about normalising the data, before I perform the analyses

1) The UCSC page explains that the data has been log(x+1) transformed. I would like to know if the raw data was normalised for library size prior to log transformation? If not, is this necessary?

I obtained the count matrix and back-transformed the counts ((2^x) - 1) and I then summed the total counts per sample and obtained different values per sample, which makes me assume that the counts were not corrected for library size

2) Finally, assuming that the data have not been normalised for library size (or distribution etc) what method would you suggest to normalise for my analysis. I understand I could just do TPM or could do TMM or another method such as that used by DESeq2

Thanks in advance for any help, I did try to find the answer to q1 elsewhere but couldn't

rna-seq • 4.3k views
0
Entering edit mode

Hey Kevin

Huge thanks for this, appreciate it. I'll convert back as you've explained and then try using DESeq2 to normalising before analysis. Will have a go at your tutorial for survival...

p.s. after I messaged I did a bit more digging and noticed that Xena has 2 'versions' TCGA for download. One is the GDC version which I am using and another is, I guess, the 'legacy' data. The latter is RSEM normalised and then logged apparently.

1
Entering edit mode

Yes, you can also use the RSEM estimated counts for input to DESeq2 via tximport - I have done this many times with TCGA data. The DESeq2 Vignette goes over all of these options.

10
Entering edit mode
3.0 years ago

Hey,

The data HTSeq - Counts (n=585) relates to raw counts that have been incremented and logged (base 2) - these are not normalised (so, there is no library size adjustment). I am honestly not sure why Xena make these raw counts available as logged counts because using these for statistical analyses makes little sense.

Note the other message:

Data from the same sample but from different vials/portions/analytes/aliquotes is averaged

For the other files that relate to FPKM 'counts' (expression levels), those are normalised for library size but there is no cross sample normalisation; so, those are not suitable for differential expression analysis.

I would continue with the HTSeq counts and convert them back to integer counts via as.integer(((2^x) - 1)) (edit: March 21st, 2020: to apply it to an entire data-matrix or -frame, do round(((2^x) - 1), 0), as you have more or less already done. Then, I would use these as input to DESeq2 or EdgeR, where they would then be normalised 'properly'. For the correlation analysis, I would eventually proceed to log-transforming the normalised counts from DESeq2 or EdgeR, and perform the correlation on those logged counts using Pearson correlation. I would use these same log-transformed normalised counts for survival analysis. I have a tutorial here for conducting survival analysis but from microarray data: Survival analysis with gene expression (there are many other tutorials online).

Kevin

Edit June 8, 2019:

I contacted UCSC just to confirm this, and here is their response:

We take the data from the GDC and do the transformation that you describe: data from the GDC + 1 and logged (base2).

The GDC has this to say about the counts data: HT-Seq Read Counts: The number of reads aligned to each gene, calculated by HT-Seq https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

If you have any further questions about the GDC data, I would recommend contacting the GDC directly: https://gdc.cancer.gov/support.

1
Entering edit mode

In case anyone else finds this while trying to use Xena's toil recompute data (kallisto) -- it is also log transformed. They do not mention this on the data page so I found out by comparing the total read counts with GDC's expression matrix for the same sample.