Interpreting TCGA .rsem.genes.results and .rsem.genes.normalized_results files.
2
20
Entering edit mode
7.5 years ago

Hello Members,

I am a complete newbie when it comes to interpreting RNA-Seq information (I'm actually studying computer science), but I have an interest in learning more about scientific research to see if this is a career I'd like to pursue. For this reason, I decided to do a summer internship in a biology lab to learn more about wet-lab techniques and possibly use my computer knowledge to help the lab I'm in.

Currently I'm running into some problems that I'm sure someone more familiar with this data can easily point out. I have been able to download an archive of the disease of interest and I'm interested in the RNASeqV2 information to determine expression levels, so I retrieved that data. After extracting the files I have 28 samples that each have the following type of files. For example, I get the following files for 28 different samples

• .junction_quantification.txt
• .rsem.genes.results
• .rsem.isoforms.results
• .rsem.genes.normalized_results
• .rsem.isoforms.normalized_results
• .bt.exon_quantification.txt

I have been told that the RNA expression is the most important so I've been focusing on interpreting the .rsem.genes.results file and the .rsem.genes.normalized_results file, but have been having difficulty. The first file is composed of 4 columns labeled gene_id, raw_count, scaled_estimate, and transcript id.

So I guess my first question is what is meant by the raw_count and scaled_estimate columns?

The second file (i.e. the .rsem.genes.normalized_results file) has only 2 columns labeled gene_id, and normalized_count. What is meant by normalized count?

Also, the people I work with have told me that having normal cells to act as a control versus the cancer cells is important. Does the normalized results file include this information?

Any information you guys can give me would be greatly appreciated.

RNA-Seq • 28k views
26
Entering edit mode
7.5 years ago
Mattias Aine ▴ 610

I've recently started looking into TCGA RNAseq data as well and I think reading the paper by Li and Dewey on RSEM might be a first step. Also for a beginner I think you should limit yourself to the gene-level data at first and make sure you have a good grip on transcription as a whole before going into isoform- or exon- level stuff.

When it comes to the data, as I've understood it the "raw_count" is the estimated number of fragments derived from a given gene and the "scaled_estimate" is the fraction of transcripts made up by a given gene.

The normalized results (normalized_count) is a simple transformation of the "raw_count" that you can do yourself to check. For gene level estimates you divide all "raw_count" values by the 75th percentile of the column (after removing zeros) and multiply that by 1000. The normalized file therefore does not take any external factors into account, but simply transforms each sample so the values are relative the 75th percentile with a x1000 adjustment factor.

The "scaled_estimate" could maybe be used as well, e.g. by multiplying it with 1M to get "transcripts per million" (TPM) which Li and Dewey state should be more comparable across samples. I had a cursory look at the "scaled_estimate" column and saw that it never sums to one. Could be because of summing of isoforms or something, but in one cases the figure was as low as .34 and under .8 was not uncommon. So if it's a measure of fraction of transcript pool, that's a bit weird. I'm also not sure how this measure would behave in comparison to using the "raw_count", so it would be great if someone could weigh in on this point!?

Most TCGA papers I've seen use the log-transformed "normalized_count" in their analyses. I'm personally playing around with this right now to get a grip on the data and have done some clusterings on log2("normalized_count"+1) median centered across genes, and stuff seems to make sense. When it comes to doing statistics I have stuck to nonparametric tests between groups, but haven't looked that much into into other methods yet.

Also when it comes to cancer and "normal cells" the best you can have is usually adjacent histologically normal tissue, this is usually not provided for most cases, but you can find the samples by the TCGA-id if you want to. If the XX in "TCGA-xx-xxxx-XXx-xxx-xxxx-xx" is "11" you are looking at a solid tissue normal, if "01" it's tumor and "10" is blood derived normal. Usually though I would say you want to look at between-tumor differences as the numbers, quality, and representativeness of the normals is not always good.

Hope this helps a bit!

0
Entering edit mode

Thanks for the information man! This really helps. When you say that it's better to look at between tumor differences, how would this information be useful? I've looked around the forum a while and see that lack of controls seems to be a problem many are facing (Tcga Lack Of Controls - Workarounds?). Have you discovered a work around involving looking at between tumor differences?

1
Entering edit mode

Can't necessarily say it's better to look at between-tumor differences, just that it's probably all you can do. But just as you might want to look for clues as to what is going on in a tumor by comparing with adjacent normal tissue, you can do the same by looking at what differentiates a given tumor (or set of tumors) from other tumors. This is basically how all gene expression studies are performed as adjacent normal is hard to come by and in the case of my field (blca) rarely normal. Also if you have a set of tumors representing all stages and grades of disease you will have a fairly broad biological spectrum that you can use to tease out the most important themes in the data.

The link you provided discusses methylation data which I feel is more of my home turf. Whether or not you need normals here again depends on the question you want to ask. I'm generally interested in characterizing what differentiates one group of tumors from another, and for that I don't necessarily need normals.

0
Entering edit mode

Makes sense, thanks again!

0
Entering edit mode

@Mattias Aine, do you have any new insight on why the sum of fractions is not one and sum <0.8 is not uncommon, please? I have had the same observation.

0
Entering edit mode

I found that genes don't include all isoforms, which explains why isoform-level scaled estimates sum to 1 while gene-level scaled estimates don't. Details are in https://gitlab.com/zyxue/understanding-firebrowse-data-format/blob/master/confirm-relationship-between-gene-level-and-isoform-level-scaled-estimates.ipynb

0
Entering edit mode

0
Entering edit mode

Sorry, I just made it public.

0
Entering edit mode

Hello Mr Aine,

I am doing prognostic by referring TF activity as two-gene ratio across samples. Under such circumstances, I am doing good in probeset dataset(hgu133a, hgu133plus2). But when I use normalized count (rsem) from TCGA, i found ratio strategy doesn't work anymore but direct (geneAcount-geneBcount) gives good results. Do you think it's feasible to do so?

Thank you!

0
Entering edit mode
6.7 years ago
juara ▴ 40

Hello

I have a question related to this thread. I would appreciate if you could help me analyzing the TCGA data. What I have done so far:

-Match these in Excel using the barcode

Now my question is if I should use the "normalized_count" for my analysis. For example, I want to see the differential expression of EGFR in No tumor group vs with tumor group. Can I make an average of "normalized_count" and compare the two groups? Or should I do log transformation? Also, I did not understand why you do log2(normalized_count +1). Can you explain a bit?

Also, do you think I should calculate Z-score before comparing and analyzing these data?

Thanks

3
Entering edit mode

Please post it as a new question.