Question: Sum of TPM values across all genes separates tumors from normals in some TCGA data sets -- what gives?
gravatar for naxerova
5.2 years ago by
United States
naxerova10 wrote:

Hi all,

I have just started playing with some RSEM RNA-seq data from the TCGA. To get to know the data better, I am running some exploratory analyses/sanity checks "for fun". One observation that really surprised me (particularly coming from a microarray world where everything is quantile normalized) is that when I order tumor/normal samples from the same tissue background by their sum of log2(TPM+1) across all genes, the normals will frequently cluster either at the bottom or at the top of the list. This happens in some, but not in all data sets. E.g. the effect is really pronounced for LIHC, but not for BRCA. 

This phenomenon seems a bit disconcerting, and I do not understand its cause. Any ideas/explanations would be much appreciated!

Thanks a lot in advance.

rna-seq rsem tcga • 3.2k views
ADD COMMENTlink modified 5.1 years ago by Biostar ♦♦ 20 • written 5.2 years ago by naxerova10

Isn't this essentially showing that the distribution of expression values is different? I'm not sure why that'd be surprising for cancer vs. normal comparisons...we kind of expect expression profiles to be heavily changed in cancer.

Edit: Note that even if the sum(TPM) is the same across samples, sum(log2(TPM+1)) needn't be.

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by Devon Ryan96k

I completely agree that you'd expect a large portion of genes to be up- or down regulated in cancer vs. normal. But I wouldn't necessarily expect the sum of all expression values to be different (by the way, this happens independently of whether I look at sum(TPM) or sum(log2(TPM+1)). After all, a normalization for library size has already taken place. In microarray data analysis, such effects would be considered artifacts and normalized away. That's probably also not the best solution. :) But the large distribution differences here do pose some problems -- or at least I think they do. For example, how can you calculate cumulative expression scores for gene sets of interest without worrying that the results will be skewed/dominated by the overall differences?

ADD REPLYlink written 5.2 years ago by naxerova10

Yeah, my comment was directed only at the log2 sum. Note that a library size normalization is not robust and doesn't really suffice if you actually need to compare values across samples (the same is the case for RPKM/FPKM for the same reason).

For looking at gene sets, normally one uses a rank-based method, so the only the relative values within a sample would matter.

ADD REPLYlink written 5.2 years ago by Devon Ryan96k

Good point about the ranks! That would of course solve the issue, but one also loses a lot of information from a sophisticated data set. Also, I am not sure how to rank all the genes with 0 values. I am interested to read more about this normalization problem -- are there any papers you can recommend? 

THANKS so much for all your help! 

ADD REPLYlink written 5.2 years ago by naxerova10

TPMs are a relatively recent thing, so I haven't seen much actually published in papers yet (you'll find most information on blogs at this point). Having said that, since the normalization issue is shared with RPKM/FPKM, the edgeR and DESeq papers should contain some discussion of this.

BTW, I think RSEM produces expected counts. I personally prefer to use those. They obviously require normalization, but the typical methods (e.g., from edgeR, or DESeq, or quantile normalization) work fine with them.

ADD REPLYlink written 5.2 years ago by Devon Ryan96k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1116 users visited in the last hour