Question

Range of TPM values in a dataset

1

Entering edit mode

4.6 years ago

Nimzo ▴ 10

Hi all,

I 've been given an RNA-seq dataset containing the TPM normalized values of read counts for each sample and wanted to make 2 questions regarding the validity of the data:

1) What is the regular range of values a TPM dataset may have? In my case, min and max values of the average gene expression across samples is 0 and 127,530 respectively (spanning over 6 orders of magnitudes), with 5 transcripts showing average expression > 10,000 and 32 transcripts having average expression between 10,000 and 1,000. The remaining features have average expression below 1,000. Is this ok?

2) If I am not mistaken, in a TPM dataset, the sum of all transcript abundances across a given sample should always be 1.000.000. However, this is not the case with my data, since the sum scores of the samples span between 500,000 and 700,000. Should I use these data for downstream analysis or better not?

Thanks in advance for your time.

RNA-Seq sequencing • 4.0k views

ADD COMMENT • link 4.6 years ago by Nimzo ▴ 10

1

Entering edit mode

You should not use TPM for inter-sample comparison. Please use the search function on why it is a poor choice. We discussed this here extensively before. Get raw counts and feed them into the standard tools such as edgeR or DESeq2. A meaningful differential analysis starts from raw counts. The vignettes of these tools explain why.

ADD REPLY • link 4.6 years ago by ATpoint 82k

0

Entering edit mode

Thank you for the answer, but never said I am gonna use them for differential expression analysis. I want to correlate them with protein abundances (data from mass spec) from matched samples. Any insights on the validity of the data I have?

ADD REPLY • link 4.6 years ago by Nimzo ▴ 10

0

Entering edit mode

Sorry for bumping this post, but my questions remain unanswered. Any fruitful feedback is gladly welcomed and appreciated.

ADD REPLY • link 4.6 years ago by Nimzo ▴ 10

1

Entering edit mode

Yes, TPM should sum up to 1mio for every sample. If it doesn't I would be careful. It is always tricky to accept data that you did not process yourself for exactly that reason: You have no idea how others processed them and if it was appropriate. If you have no other choice use these data, the best would be to get raw counts or even the fastq files.

ADD REPLY • link 4.6 years ago by ATpoint 82k

0

Entering edit mode

Thank you very much for the reply, I will try to reanalyze the fastq files from scratch.

ADD REPLY • link 4.6 years ago by Nimzo ▴ 10

0

Entering edit mode

Yes, that is the best approach. Be sure to use a more sophisticated normalization method like vst or rlog from DESeq2.

ADD REPLY • link 4.6 years ago by ATpoint 82k