Question: How Do You Justify Your Rna-Seq Expression Threshold (Fpkm/Rpkm) ?
gravatar for biorepine
5.4 years ago by
biorepine1.4k wrote:

Hi, after following 4 years of literature based on RNA-Seq studies, I understood that most of the papers arbitrarily define expression threshold i.e, >1 FPKM/RPKM to identify an expressed transcript. But how can one really justify this ?

rna-seq cutoff fpkm rpkm • 29k views
ADD COMMENTlink modified 3.0 years ago by Malachi Griffith16k • written 5.4 years ago by biorepine1.4k
gravatar for swbarnes2
5.4 years ago by
United States
swbarnes23.6k wrote:

Our lab uses spike-ins of some known RNA sequences, all at known concentrations. If the spike-in RPKM expression levels make sense, you have some evidence that RPKM for your transcripts at the same level are accurate.

Ambion ERCC spike-in controls is what we use.

ADD COMMENTlink written 5.4 years ago by swbarnes23.6k

I think using spike-in controls is key going forward with RNA-Seq experiments. Personally I was quite irritated with the absurdly low cut-offs ENCODE has been using for calling "novel" RNAs. Levels that frankly are reflecting noise picked up by the depth of sequencing.

ADD REPLYlink written 5.4 years ago by Dan Gaston7.0k

I can't agree more. However, using >1 RPKM in discovering long non-coding RNAs should be fine as they are expected to be lowly expressed.

ADD REPLYlink written 5.4 years ago by biorepine1.4k

Depends on what you expect that >1 RPKM to work out to in terms of expected number of transcripts/cell.

ADD REPLYlink written 5.4 years ago by Dan Gaston7.0k
gravatar for Gabriel R.
5.4 years ago by
Gabriel R.2.4k
Center for Geogenetik Københavns Universitet
Gabriel R.2.4k wrote:

If I were you, I would make a density plot of the FPKM values you are getting, hopefully, you will get a distinct distribution and a reliable range for your cutoff.

ADD COMMENTlink written 5.4 years ago by Gabriel R.2.4k

Still the way you choose the cutoff after plotting them is kind of arbitrary ?

ADD REPLYlink written 5.4 years ago by biorepine1.4k

arbitrary perhaps but at least justifiable.

ADD REPLYlink written 5.4 years ago by Gabriel R.2.4k

Based on the density plot, what's your suggestion on where to assign a threshold?

ADD REPLYlink written 4.3 years ago by daniel.bellieny0

Depends on the distribution. If you get a nice bimodal distribution, anything in between.

ADD REPLYlink written 4.3 years ago by Gabriel R.2.4k

Why anything in between? Do you think the expression levels of genes at the lower peak are not trustworthy? I thought these genes are just expressed at low levels.


ADD REPLYlink written 23 months ago by moushengxu300

What values to select for making the distribution graph. I have been trying to do this but failed. Please suggest me the simplest way as I am a beginner in this area. Thank you

ADD REPLYlink written 13 days ago by BIOTECH.DEEPTI9110

We used RSEM to align and quantify the RNA-seq levels, and use estimated gene count = 5 as the threshold -- if none of the samples has gene count >= 5, that gene is filtered out and not used for downstream analysis. Unless you have a strong reason not to do so, this filtering method should serve you well as it has done for us.

ADD REPLYlink modified 13 days ago • written 13 days ago by moushengxu300
gravatar for Mikael Huss
5.4 years ago by
Mikael Huss4.6k
Mikael Huss4.6k wrote:

Although spike-ins, as mentioned, are best, if you don't have them you could look at this paper:

It outlines a procedure for setting a cutoff based on finding a good compromise between low rates of false positives and false negatives, respectively. The approach compares the observed distribution of FPKMs for transcripts in the sample with FPKMs calculated for a "negative set" of regions that lie close to annotated genes but haven't been observed to be expressed in any published experiments.

ADD COMMENTlink written 5.4 years ago by Mikael Huss4.6k

Just before posting this question, I came across this paper but I was confused with the way they define false positives/negatives.

ADD REPLYlink written 5.4 years ago by biorepine1.4k
gravatar for Damian Kao
5.4 years ago by
Damian Kao14k
Damian Kao14k wrote:

Using RPKM of 1 is as arbitrary as using p-value of 0.05. There are some papers that use intronic/intergenic expression as the baseline threshold. But even that can get complicated and messy.

ADD COMMENTlink written 5.4 years ago by Damian Kao14k
gravatar for Ann
4.0 years ago by
Concord NC USA
Ann2.2k wrote:

If a read exists in your RNA-Seq data set that aligns uniquely to a gene, doesn't it mean that the original RNA sample contained a transcript from that gene? The only other way to get such a read would be contamination from genomic DNA. And if you observe more than one read aligning to your gene of interest and they are clearly not PCR duplicates, then your confidence that the gene was active in your original sample would increase. However, in practice, it is very hard to work with these very low expressed genes. For example, if you try to assay their expression using qPCR, the Cq values may be so large and variable that you can't get an accurate measurement.

On the other hand, If you are doing a more genome-scale analysis, maybe because you are interested in the diversity of genes that are expressed across different sample types (e.g., pollen, roots, leaves, trichomes) then it probably makes sense to apply a cutoff. In that scenario, some libraries might seem to indicate greater diversity of gene expression only you did more sequencing and there were more chances to observe rare reads arising from less active genes.


ADD COMMENTlink written 4.0 years ago by Ann2.2k

I think it can be easy to conflate "expressed" vs "expressed and with noticeable phenotype".

The transcriptional landscape is a stochastic bag of enzymes and molecules. Transcription happens randomly and everywhere. It just so happens certain places on the genome allows for more transcription. So in terms of expression, the 1 tag mapping to a transcript does mean expression, but does it mean it is affecting some kind of phenotype? That is probably what people want to know to gain some kind of biological insight. Depending on the cellular context, maybe 1 transcript is enough to cause some kind of amplification cascade to affect phenotype; or maybe at least 1 billion transcripts are needed. I don't think a global threshold can really be defined for "expression with phenotype". 

ADD REPLYlink written 4.0 years ago by Damian Kao14k

Exactly. There is a lot of transcriptional noise. We know that random non-gene portions of the genome get transcribed at low levels. You have to establish, at the very least, a baseline threshold for clearing that noise level to even begin to say that something is biologically relevant.


ADD REPLYlink written 4.0 years ago by Dan Gaston7.0k

Sorry, I forgot to mention another possible solution or reason to apply a cutoff. You may suspect your sample has some contamination. For example, your method of isolating single cell types might be imperfect. In that case, you could use reads from genes that you expect to be expressed in the contaminating cell type as a way to pick a cutoff. For example, you would not expect photosynthetic genes to be expressed in pollen, and so you use those to calibrate your cutoff. A reviewer of a pollen RNA-Seq paper I wrote suggested this idea. It made good sense to me so I included it in the final version of the paper. So there is at least one example of this idea "working" in a peer review scenario.

ADD REPLYlink written 4.0 years ago by Ann2.2k

Of course the downside of that is that you may filter out potentially novel unknown functions of other genes simply because you think it is contamination. 

ADD REPLYlink written 4.0 years ago by Dan Gaston7.0k

Even if you do DNAse treatment you will still have some amount of noise from genomic DNA that remains.  In addition to genomic DNA contamination, it could also be that the read was misaligned to that region.  If the read is within the intron of a gene, it could also be signal from unprocessed RNA.  I have observed instances of other contamination from the lab (e.g. genomic DNA or cDNA from another experiment) as well.  Finally, as others have mentioned, transcription is a stochastic process.  Every base in the genome is transcribed at some probability.  Only a subset of this transcription is biologically significant to most researchers doing gene expression assays.  If you have RNA-seq reads that span across what appears to be a valid splice site, this gives you a bit more confidence because exon-exon junction sequences usually do not occur in genomic DNA or unprocessed RNA.

ADD REPLYlink written 3.0 years ago by Malachi Griffith16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 911 users visited in the last hour