I'm a machine learning graduate student (PhD) working on a bioinformatics project. I'm a bit of a bioinformatics newbie so sorry if this is a dumb or duplicate question.
I'm working with TCGA PanCancer data and from what I've seen cBioPortal seems to be the easiest to use.
Specifically I am working with the invasive breast carcinoma data from here on cBioPortal.. There are four files with RNA seq data:
My hope is to use the data from
data_mrna_seq_v2_rsem.txt but I can't find the units used here. The metadata says this is "mRNA Expression, RSEM (Batch normalized from Illumina HiSeq_RNASeqV2)" but I can't tell if this is TPM, log_2(normalized_count+1), or raw read counts.
I've seen some sources saying it's log_2(normalized_count +1) but some of the values in the data set in
data_mrna_seq_v2_rsem.txt, such as 408.076, seem to be way too high to be log transformed.
My question is: how do I find out what these units are? and is there a 'best' way to access TCGA data?
I'm a math-oriented guy and I'm the only person in my lab, so sorry if this question is a pain.