cutoff for expressed genes in DESeq2
2
0
Entering edit mode
4.5 years ago
olechnwin ▴ 60

Hello,

I've analyzed my RNA-seq data using DESeq2 and now I am trying to see whether some gene of interest are expressed in our dataset. I found this paper which uses RPKM cut off of 0.3 to defined genes as expressed. I found out that I can use the fpkm function in DESeq2 to convert the counts to fpkm/rpkm. However, reading some posts in here, it seems that rlog or vst normalized counts are recommended instead. Are there any common rule of thumb for the rlog or vst-normalized counts to define expressed genes? Thanks in advance!

deseq RNA-Seq • 4.8k views
ADD COMMENT
0
Entering edit mode

it seems that rlog or vst normalized counts are recommended instead

These are recommended when you need to compare gene expression between conditions. To see whether a gene is expressed within a single condition, RPKM/FPKM are fine IMHO.

ADD REPLY
0
Entering edit mode

Why don't you convert them to TPM and use a global threshold? I noticed that "When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample.", when reading this blog post.

ADD REPLY
0
Entering edit mode

Previous discussion about defining expression cutoffs: TPM values of expressed genes

ADD REPLY
1
Entering edit mode
4.5 years ago

This is a very "philosophical" question, but to cut it short, many people in the past have used a minimum threshold of FPKM >= 1. This value is extremely eye-metric and not based on robust statistics.

It will, however, eliminate a lot of noise, and it will be easy to reference why you chose that value (e.g. "because many others did (ref1, ref2, ref3, ..., refn).

ADD COMMENT
1
Entering edit mode

My rule of thumb is FPKM >= 1, that being said I always generate a density plot of the FPKMs to view the distribution - in R I run plot(density(log2(fpkm))). This (almost always) generates a bimodal distribution and I've found that 1 is very close to the trough of that distribution. However, there are exceptions and in those cases I use the empirical cutoff, which I can defend using that density plot.

ADD REPLY
1
Entering edit mode

This might be the best suggestion you'll ever receive about it.

ADD REPLY
1
Entering edit mode
4.5 years ago

The thing about RPKM is it does't mean anything outside of the context of its samples. One sample could have very low read count, and an RPKM of 1 might mean only 1 read, while another sample could have an RPKM of 1, and that represents a hundred reads. 1 read for a gene is noise, 100 reads is not. An RPKM threshold of 0.3 might work for them with their coverage, but that might be very wrong for your samples if your coverage is very different.

TPM is going to have the same problem, it's going to flatten out how many raw counts you actually have for that sample and gene.

Also note that that paper is 10 years old, only published about 2 years after Illumina started putting sequencers in the field. They aligned with Bowtie! The original DESeq paper wasn't published for another year.

rlogs and vsts stay closer to the original raw count. If a sample has a vst corrected value of 5 for a gene, that doesn't mean there are 100 reads. It means there are less than 10.

ADD COMMENT

Login before adding your answer.

Traffic: 2710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6