Question: How Do You Justify Your Rna-Seq Expression Threshold (Fpkm/Rpkm) ?
8
gravatar for biorepine
2.6 years ago by
biorepine990
Spain
biorepine990 wrote:

Hi, after following 4 years of literature based on RNA-Seq studies, I understood that most of the papers arbitrarily define expression threshold i.e, >1 FPKM/RPKM to identify an expressed transcript. But how can one really justify this ?

rna-seq cutoff fpkm rpkm • 10.0k views
ADD COMMENTlink modified 12 weeks ago by Malachi Griffith11k • written 2.6 years ago by biorepine990
7
gravatar for swbarnes2
2.6 years ago by
swbarnes22.3k
United States
swbarnes22.3k wrote:

Our lab uses spike-ins of some known RNA sequences, all at known concentrations. If the spike-in RPKM expression levels make sense, you have some evidence that RPKM for your transcripts at the same level are accurate.

Ambion ERCC spike-in controls is what we use.

ADD COMMENTlink written 2.6 years ago by swbarnes22.3k
1

I think using spike-in controls is key going forward with RNA-Seq experiments. Personally I was quite irritated with the absurdly low cut-offs ENCODE has been using for calling "novel" RNAs. Levels that frankly are reflecting noise picked up by the depth of sequencing.

ADD REPLYlink written 2.6 years ago by Dan Gaston4.2k

I can't agree more. However, using >1 RPKM in discovering long non-coding RNAs should be fine as they are expected to be lowly expressed.

ADD REPLYlink written 2.6 years ago by biorepine990
1

Depends on what you expect that >1 RPKM to work out to in terms of expected number of transcripts/cell.

ADD REPLYlink written 2.6 years ago by Dan Gaston4.2k
4
gravatar for Gabriel R.
2.6 years ago by
Gabriel R.1.5k
Germany
Gabriel R.1.5k wrote:

If I were you, I would make a density plot of the FPKM values you are getting, hopefully, you will get a distinct distribution and a reliable range for your cutoff.

ADD COMMENTlink written 2.6 years ago by Gabriel R.1.5k
1

Still the way you choose the cutoff after plotting them is kind of arbitrary ?

ADD REPLYlink written 2.6 years ago by biorepine990
1

arbitrary perhaps but at least justifiable.

ADD REPLYlink written 2.6 years ago by Gabriel R.1.5k

Based on the density plot, what's your suggestion on where to assign a threshold?

ADD REPLYlink written 18 months ago by daniel.bellieny0
1

Depends on the distribution. If you get a nice bimodal distribution, anything in between.

ADD REPLYlink written 18 months ago by Gabriel R.1.5k
4
gravatar for Mikael Huss
2.6 years ago by
Mikael Huss3.9k
Stockholm
Mikael Huss3.9k wrote:

Although spike-ins, as mentioned, are best, if you don't have them you could look at this paper: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000598

It outlines a procedure for setting a cutoff based on finding a good compromise between low rates of false positives and false negatives, respectively. The approach compares the observed distribution of FPKMs for transcripts in the sample with FPKMs calculated for a "negative set" of regions that lie close to annotated genes but haven't been observed to be expressed in any published experiments.

ADD COMMENTlink written 2.6 years ago by Mikael Huss3.9k
2

Just before posting this question, I came across this paper but I was confused with the way they define false positives/negatives.

ADD REPLYlink written 2.6 years ago by biorepine990
3
gravatar for Damian Kao
2.6 years ago by
Damian Kao12k
UK
Damian Kao12k wrote:

Using RPKM of 1 is as arbitrary as using p-value of 0.05. There are some papers that use intronic/intergenic expression as the baseline threshold. But even that can get complicated and messy.

ADD COMMENTlink written 2.6 years ago by Damian Kao12k
3
gravatar for Ann
14 months ago by
Ann1.6k
Kannapolis NC USA
Ann1.6k wrote:

If a read exists in your RNA-Seq data set that aligns uniquely to a gene, doesn't it mean that the original RNA sample contained a transcript from that gene? The only other way to get such a read would be contamination from genomic DNA. And if you observe more than one read aligning to your gene of interest and they are clearly not PCR duplicates, then your confidence that the gene was active in your original sample would increase. However, in practice, it is very hard to work with these very low expressed genes. For example, if you try to assay their expression using qPCR, the Cq values may be so large and variable that you can't get an accurate measurement.

On the other hand, If you are doing a more genome-scale analysis, maybe because you are interested in the diversity of genes that are expressed across different sample types (e.g., pollen, roots, leaves, trichomes) then it probably makes sense to apply a cutoff. In that scenario, some libraries might seem to indicate greater diversity of gene expression only you did more sequencing and there were more chances to observe rare reads arising from less active genes.


 

ADD COMMENTlink written 14 months ago by Ann1.6k
3

I think it can be easy to conflate "expressed" vs "expressed and with noticeable phenotype".

The transcriptional landscape is a stochastic bag of enzymes and molecules. Transcription happens randomly and everywhere. It just so happens certain places on the genome allows for more transcription. So in terms of expression, the 1 tag mapping to a transcript does mean expression, but does it mean it is affecting some kind of phenotype? That is probably what people want to know to gain some kind of biological insight. Depending on the cellular context, maybe 1 transcript is enough to cause some kind of amplification cascade to affect phenotype; or maybe at least 1 billion transcripts are needed. I don't think a global threshold can really be defined for "expression with phenotype". 

ADD REPLYlink written 14 months ago by Damian Kao12k

Exactly. There is a lot of transcriptional noise. We know that random non-gene portions of the genome get transcribed at low levels. You have to establish, at the very least, a baseline threshold for clearing that noise level to even begin to say that something is biologically relevant.

 

ADD REPLYlink written 14 months ago by Dan Gaston4.2k

Sorry, I forgot to mention another possible solution or reason to apply a cutoff. You may suspect your sample has some contamination. For example, your method of isolating single cell types might be imperfect. In that case, you could use reads from genes that you expect to be expressed in the contaminating cell type as a way to pick a cutoff. For example, you would not expect photosynthetic genes to be expressed in pollen, and so you use those to calibrate your cutoff. A reviewer of a pollen RNA-Seq paper I wrote suggested this idea. It made good sense to me so I included it in the final version of the paper. So there is at least one example of this idea "working" in a peer review scenario.

ADD REPLYlink written 14 months ago by Ann1.6k

Of course the downside of that is that you may filter out potentially novel unknown functions of other genes simply because you think it is contamination. 

ADD REPLYlink written 14 months ago by Dan Gaston4.2k

Even if you do DNAse treatment you will still have some amount of noise from genomic DNA that remains.  In addition to genomic DNA contamination, it could also be that the read was misaligned to that region.  If the read is within the intron of a gene, it could also be signal from unprocessed RNA.  I have observed instances of other contamination from the lab (e.g. genomic DNA or cDNA from another experiment) as well.  Finally, as others have mentioned, transcription is a stochastic process.  Every base in the genome is transcribed at some probability.  Only a subset of this transcription is biologically significant to most researchers doing gene expression assays.  If you have RNA-seq reads that span across what appears to be a valid splice site, this gives you a bit more confidence because exon-exon junction sequences usually do not occur in genomic DNA or unprocessed RNA.

ADD REPLYlink written 12 weeks ago by Malachi Griffith11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 396 users visited in the last hour