Hi, after following 4 years of literature based on RNA-Seq studies, I understood that most of the papers arbitrarily define expression threshold i.e, >1 FPKM/RPKM to identify an expressed transcript. But how can one really justify this ?
Our lab uses spike-ins of some known RNA sequences, all at known concentrations. If the spike-in RPKM expression levels make sense, you have some evidence that RPKM for your transcripts at the same level are accurate.
It outlines a procedure for setting a cutoff based on finding a good compromise between low rates of false positives and false negatives, respectively. The approach compares the observed distribution of FPKMs for transcripts in the sample with FPKMs calculated for a "negative set" of regions that lie close to annotated genes but haven't been observed to be expressed in any published experiments.
If a read exists in your RNA-Seq data set that aligns uniquely to a gene, doesn't it mean that the original RNA sample contained a transcript from that gene? The only other way to get such a read would be contamination from genomic DNA. And if you observe more than one read aligning to your gene of interest and they are clearly not PCR duplicates, then your confidence that the gene was active in your original sample would increase. However, in practice, it is very hard to work with these very low expressed genes. For example, if you try to assay their expression using qPCR, the Cq values may be so large and variable that you can't get an accurate measurement.
On the other hand, If you are doing a more genome-scale analysis, maybe because you are interested in the diversity of genes that are expressed across different sample types (e.g., pollen, roots, leaves, trichomes) then it probably makes sense to apply a cutoff. In that scenario, some libraries might seem to indicate greater diversity of gene expression only you did more sequencing and there were more chances to observe rare reads arising from less active genes.