Question

TCGA data analysis, raw_count?!

0

Entering edit mode

8.9 years ago

juara ▴ 40

Hello

I would appreciate if you could help me analyzing the TCGA data. What I have done so far:

Download PRAD (prostate) RNAseqv2 data consisting of 550 patients
Download Clinical Data for PRAD
Match these in Excel using the barcode

Now my question is if I should use the "raw_count" or "scaled_estimate" for my analysis. For example, I want to see the differential expression of EGFR in No tumor group vs with tumor group. Can I make an average of "raw_count" and compare the two groups? Or should I do some sort of a transformation? Or scaled_estimate multiplied by 10E6 is more accurate? The numbers of scaled_estimate is very very low like 2-10*10E-5, does it mean that the gene is not getting transcribed that much?

Sorry for me being naive in this field. But I thank any ideas and comments

Thanks

RNA-Seq R TCGA • 4.3k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by juara ▴ 40

Ram · Answer 1 · 2015-06-18

You can read more about the TCGA data types here. But basically the raw_counts is the total number of counts for that gene, while the scaled_estimate is the relative fraction of reads for that gene. Notice that you also have the 'normalized_counts' data, which is the transformation of the raw data with the 75th percentile of that column.

I believe that most people take the normalized counts, log2 transform them, and then compare between samples. This way you actually internally normalize the data and can compare different samples without further normalization.

I would recommend two very useful tools that will save you much time handling the data without tedious spreadsheet work:

https://genome-cancer.ucsc.edu/proj/site/hgHeatmap/ - cancer browser
http://www.cbioportal.org/ - cBioPortal

Both tools allow you to analyze the TCGA data very easily.