Understanding TCGA Pancancer data from cBioPortal
Entering edit mode
6 weeks ago
James ▴ 20

Hello All,

I'm a machine learning graduate student (PhD) working on a bioinformatics project. I'm a bit of a bioinformatics newbie so sorry if this is a dumb or duplicate question.

I'm working with TCGA PanCancer data and from what I've seen cBioPortal seems to be the easiest to use.

Specifically I am working with the invasive breast carcinoma data from here on cBioPortal.. There are four files with RNA seq data:

  • data_mrna_seq_v2_rsem.txt
  • data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt
  • data_mrna_seq_v2_rsem_zscores_ref_diploid_samples.txt
  • data_mrna_seq_v2_rsem_zscores_ref_all_samples.txt

My hope is to use the data from data_mrna_seq_v2_rsem.txt but I can't find the units used here. The metadata says this is "mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)" but I can't tell if this is TPM, log_2(normalized_count+1), or raw read counts.

I've seen some sources saying it's log_2(normalized_count +1) but some of the values in the data set in data_mrna_seq_v2_rsem.txt, such as 408.076, seem to be way too high to be log transformed.

My question is: how do I find out what these units are? and is there a 'best' way to access TCGA data?

I'm a math-oriented guy and I'm the only person in my lab, so sorry if this question is a pain.

RNA Seq Cancer cBioPortal • 377 views
Entering edit mode

I personally prefer getting pan-cancer TCGA data from https://xenabrowser.net/datapages/?dataset=TCGA-GTEx-TARGET-gene-exp-counts.deseq2-normalized.log2&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 It is clear what they did (and they at least try to normalize for between-samples via DESeq2) and they have their code available somewhere iirc.

But in any case, I'm pretty sure that file you listed is simply RSEM estimated raw read counts. It's definitely not log-transformed. If it were TPM, the sum of the values across all genes/transcripts for any given sample would always be 1 million.

Entering edit mode
6 weeks ago
Ernest Bonat ▴ 10

Hello James,

I got the same issue when I looked at this dataset before. Here is the dataset I used to apply classification machine learning algorithms: gene expression cancer RNA-Seq Data Set. I have done many projects applying machine learning to genomics datasets. Let me know how I can help you with it?


Login before adding your answer.

Traffic: 808 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6