I'm struggling to find papers that use transcripts per million (TPMs) on their pre-processing steps for filtering out non-expressed genes or very low expression genes. I'm aware that filtering is usually recommended with raw read counts as they provide more information to work with for the decision, however sometimes it is not possible to work with the raw read counts. I'm interested more than anything on what authors consider expressed (say TPMs of at least 1 or TPMs of at least 5) and what authors would consider a low expressed gene (say x percent of TPMs for a gene across samples don't meet the expression criteria). I know that the heuristic concept of TPM = 5 is roughly 1 transcript in a cell at any given time exists, but I haven't seen this mentioned in any citable works.
So far I've managed to find this article which investigates tibial nerve samples available in the GTEX project. They filter out genes with median TPM lesser than 0.5 or with max TPM lesser than 1 across samples. The GTEX project is a good example of a situation where you would want to filter by TPM since they already performed high quality processing of raw read counts and researchers may pickup the TPMs from the start. Does anyone know more papers in which filtering is established directly over the TPM counts?