Question

What are some references of works that use a TPM expression threshold for filtering samples/genes?

0

Entering edit mode

4.2 years ago

n,n ▴ 360

I'm struggling to find papers that use transcripts per million (TPMs) on their pre-processing steps for filtering out non-expressed genes or very low expression genes. I'm aware that filtering is usually recommended with raw read counts as they provide more information to work with for the decision, however sometimes it is not possible to work with the raw read counts. I'm interested more than anything on what authors consider expressed (say TPMs of at least 1 or TPMs of at least 5) and what authors would consider a low expressed gene (say x percent of TPMs for a gene across samples don't meet the expression criteria). I know that the heuristic concept of TPM = 5 is roughly 1 transcript in a cell at any given time exists, but I haven't seen this mentioned in any citable works.

So far I've managed to find this article which investigates tibial nerve samples available in the GTEX project. They filter out genes with median TPM lesser than 0.5 or with max TPM lesser than 1 across samples. The GTEX project is a good example of a situation where you would want to filter by TPM since they already performed high quality processing of raw read counts and researchers may pickup the TPMs from the start. Does anyone know more papers in which filtering is established directly over the TPM counts?

RNA-Seq • 2.8k views

ADD COMMENT • link updated 4.2 years ago by Biostar 20 • written 4.2 years ago by n,n ▴ 360

1

Entering edit mode

If I recall correctly a major output of Kallisto is the TPM metric. Maybe try looking at papers that use Kallisto

ADD REPLY • link 4.2 years ago by curious ▴ 750

1

Entering edit mode

Related discussion: TPM values of expressed genes

ADD REPLY • link 4.1 years ago by igor 13k

score 1 · Answer 1 · 2020-03-02

In my opinion, it's impossible to reliably determine what exactly may be non-expressed or lowly-expressed without further experiments (e.g. those including spike-ins). We only have heuristics -- which are probably extrapolated from previously published works/experiments (oftentimes without citation), but there is no golden rule and I can't think of any studies that actually reliably validate these heuristics.

See the following blog post (from the kallisto author) for a discussion where the concept of using such thresholds might have arisen: https://liorpachter.wordpress.com/2014/04/30/estimating-number-of-transcripts-from-rna-seq-measurements-and-why-i-believe-in-paywall/