Question: cutoff for expressed genes in DESeq2
0
gravatar for olechnwin
10 months ago by
olechnwin30
olechnwin30 wrote:

Hello,

I've analyzed my RNA-seq data using DESeq2 and now I am trying to see whether some gene of interest are expressed in our dataset. I found this paper which uses RPKM cut off of 0.3 to defined genes as expressed. I found out that I can use the fpkm function in DESeq2 to convert the counts to fpkm/rpkm. However, reading some posts in here, it seems that rlog or vst normalized counts are recommended instead. Are there any common rule of thumb for the rlog or vst-normalized counts to define expressed genes? Thanks in advance!

rna-seq deseq • 601 views
ADD COMMENTlink modified 10 months ago by swbarnes28.2k • written 10 months ago by olechnwin30

it seems that rlog or vst normalized counts are recommended instead

These are recommended when you need to compare gene expression between conditions. To see whether a gene is expressed within a single condition, RPKM/FPKM are fine IMHO.

ADD REPLYlink written 10 months ago by Carlo Yague5.0k

Why don't you convert them to TPM and use a global threshold? I noticed that "When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample.", when reading this blog post.

ADD REPLYlink written 10 months ago by Mahdiar Sadeghi0

Previous discussion about defining expression cutoffs: TPM values of expressed genes

ADD REPLYlink written 10 months ago by igor11k
1
gravatar for Macspider
10 months ago by
Macspider3.1k
Vienna - BOKU
Macspider3.1k wrote:

This is a very "philosophical" question, but to cut it short, many people in the past have used a minimum threshold of FPKM >= 1. This value is extremely eye-metric and not based on robust statistics.

It will, however, eliminate a lot of noise, and it will be easy to reference why you chose that value (e.g. "because many others did (ref1, ref2, ref3, ..., refn).

ADD COMMENTlink written 10 months ago by Macspider3.1k
1

My rule of thumb is FPKM >= 1, that being said I always generate a density plot of the FPKMs to view the distribution - in R I run plot(density(log2(fpkm))). This (almost always) generates a bimodal distribution and I've found that 1 is very close to the trough of that distribution. However, there are exceptions and in those cases I use the empirical cutoff, which I can defend using that density plot.

ADD REPLYlink written 10 months ago by shawn.w.foley1.2k
1

This might be the best suggestion you'll ever receive about it.

ADD REPLYlink written 10 months ago by Macspider3.1k
1
gravatar for swbarnes2
10 months ago by
swbarnes28.2k
United States
swbarnes28.2k wrote:

The thing about RPKM is it does't mean anything outside of the context of its samples. One sample could have very low read count, and an RPKM of 1 might mean only 1 read, while another sample could have an RPKM of 1, and that represents a hundred reads. 1 read for a gene is noise, 100 reads is not. An RPKM threshold of 0.3 might work for them with their coverage, but that might be very wrong for your samples if your coverage is very different.

TPM is going to have the same problem, it's going to flatten out how many raw counts you actually have for that sample and gene.

Also note that that paper is 10 years old, only published about 2 years after Illumina started putting sequencers in the field. They aligned with Bowtie! The original DESeq paper wasn't published for another year.

rlogs and vsts stay closer to the original raw count. If a sample has a vst corrected value of 5 for a gene, that doesn't mean there are 100 reads. It means there are less than 10.

ADD COMMENTlink written 10 months ago by swbarnes28.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 794 users visited in the last hour