Question

Shrink logFC values after edgeR

1

Entering edit mode

8 months ago

arvind.1 ▴ 10

Good morning,

I am new to RNA-seq data analysis. After performing edgeR , I got logFC values ranging from -20 to 20. Can I manually shrink log fold change (logFC) values by a factor of 0.5 (or any other factor) in the output obtained from edgeR? If not,can you provide me with some functions and packages that can solve this problem? I look forward to hearing from you. Thank you for your time and consideration.

logFC edgeR • 1.6k views

ADD COMMENT • link updated 6 months ago by Gordon Smyth ★ 7.4k • written 8 months ago by arvind.1 ▴ 10

2

Entering edit mode

Please check out the post here in Bioconductor. It has been very nicely explained what you could do or should not do.

ADD REPLY • link 8 months ago by bk11 ★ 3.0k

0

Entering edit mode

Thank you for the reply

ADD REPLY • link 8 months ago by arvind.1 ▴ 10

0

Entering edit mode

Thank you for your reply.

ADD REPLY • link 8 months ago by arvind.1 ▴ 10

score 2 · Answer 1 · 2024-02-02

Why is getting logFC=20 causing a "problem" for you? You would only get such a large logFC from edgeR if the counts are zero in one group and in the hundreds of thousands in the other. Getting a large logFC seems a fair representation in such cases.

Anyway, you can shrink the log-fold-changes by as much or as little as you want in edgeR. Just use a larger value of prior.count when you run glmFit (or glmQLFit). Or you can use the predFC function, which will give the same result as glmFit.

By default, the model fitting functions in edgeR are set to do light shrinkage of the logFCs by setting prior.count = 0.125. If you increase prior.count to 5, that will do quite heavy shrinkage. Note that shrinkage is not the same as simply multiplying the logFC by a scaling factor. Increasing prior.count has much more effect on the logFC for low count genes than on logFCs for highly expressed genes because the logFC for low count genes are less certain. Changing prior.count only affects the estimated logFCs. All the other output including test statistics and p-values remain the same.

The reason why the argument is called prior.count is because the shrinkage corresponds to a Bayesian model where the size of the counts in a prior dataset encapsulates the strength of the prior belief that the logFCs should be close to zero. Long ago (15 years ago) we toyed around with the idea of estimating an optimal value for the prior.count automatically from the data (it's a chapter in Belinda Phipson's thesis) but eventually decided it would be safer and better to use preset values in the public edgeR package.

To illustrate the use of prior.count, here is a toy example where edgeR does return logFC=20:

> counts <- matrix(c(0,1000,135000,2010),2,2)
> colnames(counts) <- c("A","B")
> rownames(counts) <- c("Gene1","Gene2")
> counts
         A      B
Gene1    0 135000
Gene2 1000   2010
> group <- c("A","B")
> design <- model.matrix(~group)
> fit <- glmFit(counts,design,dispersion=0,lib.size=rep(10e6,2))
> lrt <- glmLRT(fit)
> topTags(lrt)
Coefficient:  groupB 
        logFC  logCPM        LR     PValue        FDR
Gene1 20.0426 12.7207 187149.74 0.0000e+00 0.0000e+00
Gene2  1.0071  7.2355    345.57 3.9103e-77 3.9103e-77

Changing the default setting to prior.count=5 shrinks the first logFC considerably while leaving the second almost unaffected:

> fit <- glmFit(counts,design,dispersion=0,prior.count=5,lib.size=rep(10e6,2))
> lrt <- glmLRT(fit)
> topTags(lrt)
Coefficient:  groupB 
        logFC  logCPM        LR     PValue        FDR
Gene1 14.7207 12.7207 187149.74 0.0000e+00 0.0000e+00
Gene2  1.0036  7.2355    345.57 3.9103e-77 3.9103e-77