Hi all,
I 've been given an RNA-seq dataset containing the TPM normalized values of read counts for each sample and wanted to make 2 questions regarding the validity of the data:
1) What is the regular range of values a TPM dataset may have? In my case, min and max values of the average gene expression across samples is 0 and 127,530 respectively (spanning over 6 orders of magnitudes), with 5 transcripts showing average expression > 10,000 and 32 transcripts having average expression between 10,000 and 1,000. The remaining features have average expression below 1,000. Is this ok?
2) If I am not mistaken, in a TPM dataset, the sum of all transcript abundances across a given sample should always be 1.000.000. However, this is not the case with my data, since the sum scores of the samples span between 500,000 and 700,000. Should I use these data for downstream analysis or better not?
Thanks in advance for your time.
You should not use TPM for inter-sample comparison. Please use the search function on why it is a poor choice. We discussed this here extensively before. Get raw counts and feed them into the standard tools such as
edgeR
orDESeq2
. A meaningful differential analysis starts from raw counts. The vignettes of these tools explain why.Thank you for the answer, but never said I am gonna use them for differential expression analysis. I want to correlate them with protein abundances (data from mass spec) from matched samples. Any insights on the validity of the data I have?
Sorry for bumping this post, but my questions remain unanswered. Any fruitful feedback is gladly welcomed and appreciated.
Yes, TPM should sum up to 1mio for every sample. If it doesn't I would be careful. It is always tricky to accept data that you did not process yourself for exactly that reason: You have no idea how others processed them and if it was appropriate. If you have no other choice use these data, the best would be to get raw counts or even the fastq files.
Thank you very much for the reply, I will try to reanalyze the fastq files from scratch.
Yes, that is the best approach. Be sure to use a more sophisticated normalization method like
vst
orrlog
from DESeq2.