Question

Sum of TPM values across all genes separates tumors from normals in some TCGA data sets -- what gives?

1

Entering edit mode

10.1 years ago

naxerova ▴ 20

Hi all,

I have just started playing with some RSEM RNA-seq data from the TCGA. To get to know the data better, I am running some exploratory analyses/sanity checks "for fun". One observation that really surprised me (particularly coming from a microarray world where everything is quantile normalized) is that when I order tumor/normal samples from the same tissue background by their sum of log2(TPM+1) across all genes, the normals will frequently cluster either at the bottom or at the top of the list. This happens in some, but not in all data sets. E.g. the effect is really pronounced for LIHC, but not for BRCA.

This phenomenon seems a bit disconcerting, and I do not understand its cause. Any ideas/explanations would be much appreciated!

Thanks a lot in advance.

rsem RNA-Seq tcga • 4.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by naxerova ▴ 20

1

Entering edit mode

Isn't this essentially showing that the distribution of expression values is different? I'm not sure why that'd be surprising for cancer vs. normal comparisons...we kind of expect expression profiles to be heavily changed in cancer.

Edit: Note that even if the sum(TPM) is the same across samples, sum(log2(TPM+1)) needn't be.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Devon Ryan 105k

1

Entering edit mode

I completely agree that you'd expect a large portion of genes to be up- or down regulated in cancer vs. normal. But I wouldn't necessarily expect the sum of all expression values to be different (by the way, this happens independently of whether I look at sum(TPM) or sum(log2(TPM+1)). After all, a normalization for library size has already taken place. In microarray data analysis, such effects would be considered artifacts and normalized away. That's probably also not the best solution. :) But the large distribution differences here do pose some problems -- or at least I think they do. For example, how can you calculate cumulative expression scores for gene sets of interest without worrying that the results will be skewed/dominated by the overall differences?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by naxerova ▴ 20

0

Entering edit mode

Yeah, my comment was directed only at the log2 sum. Note that a library size normalization is not robust and doesn't really suffice if you actually need to compare values across samples (the same is the case for RPKM/FPKM for the same reason).

For looking at gene sets, normally one uses a rank-based method, so the only the relative values within a sample would matter.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

Good point about the ranks! That would of course solve the issue, but one also loses a lot of information from a sophisticated data set. Also, I am not sure how to rank all the genes with 0 values. I am interested to read more about this normalization problem -- are there any papers you can recommend?

THANKS so much for all your help!

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by naxerova ▴ 20

0

Entering edit mode

TPMs are a relatively recent thing, so I haven't seen much actually published in papers yet (you'll find most information on blogs at this point). Having said that, since the normalization issue is shared with RPKM/FPKM, the edgeR and DESeq papers should contain some discussion of this.

BTW, I think RSEM produces expected counts. I personally prefer to use those. They obviously require normalization, but the typical methods (e.g., from edgeR, or DESeq, or quantile normalization) work fine with them.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Devon Ryan 105k