Question: Selecting Deferentially Expressed Genes in RNASeq data analysis - DESEq2 and Cuffdiff
0
gravatar for Sreeraj Thamban
13 months ago by
Indian Institute of Science Education and Research
Sreeraj Thamban100 wrote:

Hi all, During Differential gene expression analysis of RNASeq data (DESEq2 or Cufdiff) which is best method to filter differentially expressed genes? Should I go with all the genes having adjusted P value < 0.05 or should I filter them based on a log2 Fold change cut-off?

Thank you

rna-seq deseq2 • 993 views
ADD COMMENTlink modified 13 months ago by Kevin Blighe30k • written 13 months ago by Sreeraj Thamban100
1

You can also try an integrated approach like metaseqR with more than one algorithms and combined p-values. In this way you don't have to struggle with comparisons as the method combines the "advantages" of many algorithms towards the optimization of precision-recall tradeoff. Disclaimer: I am the author of that package.

ADD REPLYlink written 13 months ago by Panagiotis Moulos20

Thank you Moulos, I will definitely give metaseqR a try.

ADD REPLYlink written 13 months ago by Sreeraj Thamban100
5
gravatar for Kevin Blighe
13 months ago by
Kevin Blighe30k
Republic of Ireland
Kevin Blighe30k wrote:

Hi Sreeraj,

Traditionally, people usually start the filtering process at FDR Q<0.05 (i.e. 5% FDR or adjusted P<0.05) and also apply a log base 2 fold-change cut-off of absolute 2 (>|2|). The type of false discovery rate used is usually Benjamini-Hochberg. Bonferroni correction, which is very stringent, is not used as much. There are other types of correction for false discovery rate, too.

The use of both adjusted P value and log base 2 fold-change cut-off is important because many genes may have a highly significant P value but very low fold-change difference, or vice-versa. This is mainly due to differences in counts. For example, if we have gene1 and gene2 at mean expression levels of 12 and 6 across our cases and controls, respectively, are they significantly different? What if gene3 and gene4 had expression levels of 78 and 39? The P values in both cases may be different but the fold-changes would be the same. Other more complex scenarios can occur.

If you have used DESeq2, then I assume that your input to it was raw counts (not normalised counts from Cufflinks)? DESeq2 normalises raw counts 'quite well' and produces more credible statistics, from my experience. For one, DESeq2 deals quite well with the variability in counts that occurs from RNA-seq data.

Finally, if you get nothing significant from FDR Q<0.05 and log2 FC>|2|, you will have to consider relaxing to, initially, FDR Q<0.1, which is still somewhat acceptable. After that, I would consider relaxing the log2 FC to |1.5|, and so on.

Hope that this helps.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Kevin Blighe30k

Thank you Kevin this cleared my doubts. I have one more question, I took 1.3 as fold change cutoff to filter DEGs during the analysis, is 1.3 is an acceptable fold change? or should I increase?

Thank you

ADD REPLYlink written 13 months ago by Sreeraj Thamban100
1

Hey Sreeraj,

It's not great but, I mean, I used log2 1.3 in the past. On the linear scale it equates to ~2.5 fold-change difference, which still means that the expression is more than double. Coupled with a good adjusted P value, you can probably justify it. In certain tissues, like blood, getting large fold-change differences is difficult.

ADD REPLYlink written 13 months ago by Kevin Blighe30k

Hi Kevin,

In Deseq2 to get DEGs based on FDR < 0.1 and FC > 1.5

dds <- DESeq(dds)

After this to select siginificant DEGs I have seen two analysis.

First analysis:

res <- results(dds, alpha = 0.1)
res2 <- res[res$log2FoldChange >= 1, ]
summary(res2)
out of 10789 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 148, 1.2%
LFC < 0 (down)     : 92, 0.77%

Second one:

res <- results(dds, lfcThreshold = log2(1.5), alpha = 0.1)
summary(res)
out of 11949 with nonzero total read count
adjusted p-value < 0.1
LFC > 0.58 (up)    : 15, 0.13%
LFC < -0.58 (down) : 5, 0.042%
outliers [1]       : 81, 0.68%
low counts [2]     : 7603, 64%
(mean count < 12)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Among the both analysis which is the right one?

ADD REPLYlink written 3 months ago by Bioinfo270

The main difference is that, in the first one, you do not set a value for lfcThreshold; whereas, in the second, you set a value for it.

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe30k

Yes, you are right. So, if I want to select DEGs based on FC > 1.5 and FDR < 0.1 second analysis is the right one. But in this [Filter by Fold change in DESeq2] e.rempel says that first approach is the right one.

ADD REPLYlink written 3 months ago by Bioinfo270

There is no right or wrong. You set the thresholds to what you want them to be.

ADD REPLYlink written 3 months ago by Kevin Blighe30k
1

res <- results(dds, alpha = 0.1) is equivalent to absolute Log2FC>0 & FDR<0.1

res <- results(dds, lfcThreshold = log2(1.5), alpha = 0.1) is equivalent to absolute Log2FC > 0.585 & FDR<0.1

... ...

res <- results(dds, lfcThreshold = 1.5, alpha = 0.1) is equivalent to absolute Log2FC > 1.5 & FDR<0.1

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe30k

Yes, In this res <- results(dds, lfcThreshold = log2(1.5), alpha = 0.1) is nothing but selecting genes with FC > 1.5 and FDR < 0.1 right?

ADD REPLYlink modified 3 months ago • written 3 months ago by Bioinfo270

If I may get very pedantic here, these are not strictly thresholds in the sense that I think that you believe. These parameters are altering what is the hypothesis being tested. If you look at the info for the results() function:

lfcThreshold: a non-negative value which specifies a log2 fold change
          threshold. The default value is 0, corresponding to a test
          that the log2 fold changes are equal to zero. The user can
          specify the alternative hypothesis using the ‘altHypothesis’
          argument, which defaults to testing for log2 fold changes
          greater in absolute value than a given threshold. If
          ‘lfcThreshold’ is specified, the results are for Wald tests,
          and LRT p-values will be overwritten.

altHypothesis: character which specifies the alternative hypothesis,
          i.e. those values of log2 fold change which the user is
          interested in finding. The complement of this set of values
          is the null hypothesis which will be tested. If the log2 fold
          change specified by ‘name’ or by ‘contrast’ is written as
          beta , then the possible values for ‘altHypothesis’ represent
          the following alternate hypotheses:

            • greaterAbs: |beta| > lfcThreshold , and p-values are
              two-tailed

            • lessAbs: |beta| < lfcThreshold , NOTE: this requires that
              ‘betaPrior=FALSE’ has been specified in the previous
              ‘DESeq’ call.  p-values are the maximum of the upper and
              lower tests.

            • greater: beta > lfcThreshold

            • less: beta < -lfcThreshold

So, lfcThreshold is not strictly a cut-off for differential expression and statistical significance in the sense that we typically think. Reading the text closely, the default for lfcThreshold is zero / nil, which means that the hypothesis being tested is that genes will be equally up- and down-regulated in your samples. Unless you really know what you're doing, I suggest leaving this at the default and just using:

res <- results(dds, alpha = 0.1)

...or, if your intended FDR cut-off is 0.05:

res <- results(dds, alpha = 0.05)

Hope that this helps.

Kevin

PS - if you do want to set lfcThreshold, then it expects a vale on the log2 scale. So, it you set it to log2(1.5), then you are specifying log2(1.5) = 0.585

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1729 users visited in the last hour