Question

Understanding TCGA Pancancer data from cBioPortal

1

Entering edit mode

20 months ago

James ▴ 30

Hello All,

I'm a machine learning graduate student (PhD) working on a bioinformatics project. I'm a bit of a bioinformatics newbie so sorry if this is a dumb or duplicate question.

I'm working with TCGA PanCancer data and from what I've seen cBioPortal seems to be the easiest to use.

Specifically I am working with the invasive breast carcinoma data from here on cBioPortal.. There are four files with RNA seq data:

data_mrna_seq_v2_rsem.txt
data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt
data_mrna_seq_v2_rsem_zscores_ref_diploid_samples.txt
data_mrna_seq_v2_rsem_zscores_ref_all_samples.txt

My hope is to use the data from data_mrna_seq_v2_rsem.txt but I can't find the units used here. The metadata says this is "mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)" but I can't tell if this is TPM, log_2(normalized_count+1), or raw read counts.

I've seen some sources saying it's log_2(normalized_count +1) but some of the values in the data set in data_mrna_seq_v2_rsem.txt, such as 408.076, seem to be way too high to be log transformed.

My question is: how do I find out what these units are? and is there a 'best' way to access TCGA data?

I'm a math-oriented guy and I'm the only person in my lab, so sorry if this question is a pain.

Cancer RNA-Seq cBioPortal • 1.9k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 20 months ago by James ▴ 30

0

Entering edit mode

I personally prefer getting pan-cancer TCGA data from https://xenabrowser.net/datapages/?dataset=TCGA-GTEx-TARGET-gene-exp-counts.deseq2-normalized.log2&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 It is clear what they did (and they at least try to normalize for between-samples via DESeq2) and they have their code available somewhere iirc.

But in any case, I'm pretty sure that file you listed is simply RSEM estimated raw read counts. It's definitely not log-transformed. If it were TPM, the sum of the values across all genes/transcripts for any given sample would always be 1 million.

ADD REPLY • link 20 months ago by dsull ★ 5.8k

Ram · Answer 1 · 2022-08-11

0

Entering edit mode

20 months ago

Ernest Bonat ▴ 10

Hello James,

I got the same issue when I looked at this dataset before. Here is the dataset I used to apply classification machine learning algorithms: gene expression cancer RNA-Seq Data Set. I have done many projects applying machine learning to genomics datasets. Let me know how I can help you with it?

ADD COMMENT • link updated 20 months ago by Ram 43k • written 20 months ago by Ernest Bonat ▴ 10

score 0 · Answer 2 · 2023-01-31

I guess you already got the answer, but I recommend to read the meta_XXX.txt file in the same repository. For example, you can find the brief description about the 'data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt' in 'meta_mrna_seq_v2_rsem_zscores_ref_normal_samples'. It contains the below contents:

cancer_study_identifier: coadread_tcga_pan_can_atlas_2018
genetic_alteration_type: MRNA_EXPRESSION
datatype: Z-SCORE
stable_id: rna_seq_v2_mrna_median_all_sample_ref_normal_Zscores
show_profile_in_analysis_tab: TRUE
profile_name: mRNA expression z-scores relative to normal samples (log RNA Seq V2 RSEM)
profile_description: Expression z-scores of tumor samples compared to the expression distribution of all log-transformed mRNA expression of adjacent normal samples in the cohort.
data_filename: data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt