Hi, after following 4 years of literature based on RNA-Seq studies, I understood that most of the papers arbitrarily define expression threshold i.e, >1 FPKM/RPKM to identify an expressed transcript. But how can one really justify this ?
Our lab uses spike-ins of some known RNA sequences, all at known concentrations. If the spike-in RPKM expression levels make sense, you have some evidence that RPKM for your transcripts at the same level are accurate.
Ambion ERCC spike-in controls is what we use.
Although spike-ins, as mentioned, are best, if you don't have them you could look at this paper: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000598
It outlines a procedure for setting a cutoff based on finding a good compromise between low rates of false positives and false negatives, respectively. The approach compares the observed distribution of FPKMs for transcripts in the sample with FPKMs calculated for a "negative set" of regions that lie close to annotated genes but haven't been observed to be expressed in any published experiments.