Interpretation of logFC in limma-voom contrasts
2
0
Entering edit mode
10 days ago
donan • 0

Hi everyone, I used limma-voom on RNA-seq data and I am trying to understand the results. I am using contrasts to compare group A versus group B1, B2, and B3, however I don't really understand my logFC values. From what I understand, for a gene, a logFC = 1 means that this gene's expression in group A will be 2 times its expression in group B1. However this does not seem to be the case for my data.

First, my data are normalized, but for the moment nobody can tell me how. It seems they are in log2 (counts are between 0 and 15, and using 2^x gives values similar to count data), but I don't know if any other normalization method was applied.

However I don't think this is the problem (also, my volcano plot seems normal). I followed this video and I found similar results (most of the logFC I found are between -2 and 2), but I don't understand the interpretation of the logFC in this case. In the video at 12:50, he says "it looks as if this is a simple minus of each other", and this is what I understand. When looking at the histogram of a particular deregulated gene (let's say group A versus group B1), if most of the values of group A are close to 10 and most of the values of group B1 alors close to 6, the logFC will be close to 4, which is 10-6. However, logFC = 4 means a fold change = 2^4 = 16, and this is obviously not matching my data.

Is my reasoning right, or am I missing something ? Do you believe I should look at my code/data again ?

Thanks for your help.

limma logFC limma-voom • 1.2k views
ADD COMMENT
0
Entering edit mode

a) Limma-voom input is raw counts

b) add code

c) follow the limma user guide, not these random yt videos

Cannot comment further because mainly b) is missing.

ADD REPLY
0
Entering edit mode

Thank you for your answer,

a) I know, however when I skip the voom function, while most of the logFC are between -2 and 2 (which is already a lot for my data), the maximum logFC is 5 (i.e. a fold change of 32), which is absolutely not realistic when looking at this gene. However, when using voom, most of the logFC are between -0.5 and 0.5, which is more realistic. For a given gene, 2^logFC really matches with the difference between groups seen on the histogram. I feel like voom makes sense here, I might be wrong but I don't really understand why.

b) count_mat is a count matrix with samples in column and gene in rows.

count_mat_DGE <- DGEList(counts = count_mat, group = condition)
design <- model.matrix(~ condition + batch)
# If using voom
vfit <- voom(count_mat_DGE$counts, design)
fit <- lmFit(vfit, design)
# If not using voom
fit <- lmFit(count_mat_DGE$counts, design)
# In both cases
efit <- eBayes(fit)
tfit <- treat(efit, lfc=0) # To have the possibility to change the lfc threshold
# As an exemple, A vs B1
A.vs.B1<- topTreat(tfit, coef="conditionB1", n=Inf)
A.vs.B1$logFC <- A.vs.B1$logFC*(-1) # Because ref level here is A, but I want to see how A behaves compared to B1

Without voom, the most deregulated genes in A.vs.B1 have absolute logFC between 2 and 5, which is never the case in reality. With voom, the most deregulated genes have an absolute logFC between 0.3 and 1, which is in fact what I see when looking at an histogram.

So I have the answer for my original post, and I understand correctly the logFC, however now the issue is that my logFC only seem correct when using voom and I don't know why.

c) I've been reading and comparing a lot of documentation (Bionconductor, github, etc) on limma, voom, how to write it, to have a consistent and well defined method, I am not just following a random yt video.

Edit: Most of the volcano plot I see online have larger absolute logFC than 0.5. But when I look my histogram, my guess is that there is no larger logFC than 1. Am I still missing something? Or maybe is it just because of the data normalization?

ADD REPLY
0
Entering edit mode

You really need to follow the user guide if you're new. The voom part is wrong because you do not calculate any normalizatiuon prior to voom (calcNormFactors) so youre vooming the raw counts without correction for depth and composition, and the "not using voom" part is even more wrong because you're plainly fitting models on the raw counts which makes no sense.

ADD REPLY
0
Entering edit mode

OP's data is not raw counts as noted in original post

First, my data are normalized, but for the moment nobody can tell me how.

ADD REPLY
0
Entering edit mode

I think the biggest issue is that I don't know what normalization procedure was applied to my data, as I don't know if calcNormFactors was already used, if they only applied log2, etc.

ADD REPLY
0
Entering edit mode

Missed that. Ok, then voom makes no sense. You can try limma-trend if it is log2, see user guide.

ADD REPLY
0
Entering edit mode

count_mat is a count matrix

How can that be, since your original post says that you have some sort of normalized log2 quantity, rather than actual counts? What precise quantities did you put into count_mat?

However I don't think this is the problem

This absolutely is the problem!

You have to know, at very least, whether the quantities have been normalized for library size or not. Without that, I can't see how you can approach the data in any sensible way.

ADD REPLY
0
Entering edit mode

Thank you for your answer. Sorry it was unclear, count_mat is not really a count matrix, but the normalized count matrix I was talking about. For each cell, I tried to sum the (normalized) count for every genes, and the sums are close for every cell. The coefficient of variation is 0.8%, which makes me think the quantities have been normalized for library size.

ADD REPLY
0
Entering edit mode

THere is no point in guessing. A reproducible analysis starts from a matrix of raw counts and the documentation how it was created. Otherwise, what's the point?

ADD REPLY
0
Entering edit mode
1 day ago

Your reasoning is spot-on in identifying the apparent mismatch, but I think the key issue here is a subtle confusion about what those "values close to 10 and 6" in your histogram actually represent on the original count scale. Let me break this down step-by-step, and I'll explain why your logFC of ~4 does correspond to a 16-fold change—it's just that the fold change is multiplicative, not additive, when we exponentiate back to counts.

Quick Recap of limma-voom and logFC Interpretation

  • voom transformation: In limma-voom (from the limma package), your input is typically raw or normalized counts (e.g., CPM or similar). voom then applies a log2 transformation (roughly log2(count + offset)) to stabilize variance across expression levels. The downstream model fits on these log2-transformed values, so your fitted means (and thus logFC) are differences on the log scale.
  • logFC meaning: A logFC of 1 means the mean log2-expression in group A is 1 unit higher than in B1, which translates to a fold change (FC) of 2^1 = 2 on the original count scale (i.e., A has twice as many counts as B1, on average).
  • Histogram context: When you plot histograms or boxplots of a gene's values per group (as in that video), you're almost certainly visualizing the log2-transformed data (post-voom). That's why the values cluster around 10 and 6—they're not raw counts (which would be much larger, like thousands). Raw counts for log2=10 would be ~2^10 = 1024 reads, and for log2=6 ~64 reads.

Why Your Example Matches Perfectly

Let's use your numbers:

  • Mean log2 in A ~ 10 -> Mean counts in A ~ 2^10 = 1,024
  • Mean log2 in B1 ~ 6 -> Mean counts in B1 ~ 2^6 = 64
  • logFC ~ 10 - 6 = 4 (difference on log scale)
  • Fold change = 2^4 = 16 (1,024 / 64 = 16, exactly!)

So, it does match: Group A has 16 times more expression (counts) than B1 for that gene. The "simple minus" the video mentions (at ~12:50) is just the logFC calculation itself—it's literally the difference of the fitted log2 means between groups. But to interpret biological change, you always exponentiate: FC = 2^|logFC| (and sign indicates direction).

If your histograms were showing raw counts, you'd see massive spreads (e.g., 500–2,000 for A, 20–200 for B1), and the logFC would still compute the same way after transformation. But since voom works on logs, your plots are likely already there.

About Normalization

You mentioned uncertainty on how the data were normalized—fair point, as it can trip people up:

  • Counts 0–15 sound like log2(CPM) or log2(TPM) (common in voom pipelines). The "2^x gives values similar to count data" confirms it's log2-scale.
  • limma-voom often pairs with edgeR's TMM (trimmed mean of M-values) for between-sample normalization, which adjusts for library size/composition without altering the logFC interpretation.
  • Your volcano plot looking "normal" (logFC -2 to 2 range) is a good sign—extreme logFCs (>4–5) are rarer in well-powered RNA-seq, especially for subtle effects.

To double-check:

  • Run plotSA(fit) or plotMDS(v) in your limma session to confirm voom's transformation looks good (mean-variance trend stabilized).
  • For a gene, extract raw counts with as.data.frame(dge$counts) (if you have the DGEList object) and plot boxplots pre-voom to see the fold change visually.

Kevin

ADD COMMENT
0
Entering edit mode

In limma-voom (from the limma package), your input is typically raw or normalized counts (e.g., CPM or similar).

No, voom can only accept raw counts. It cannot accept input in the form of normalized counts, and certainly not CPM or similar. You may be getting confused with the lmFit pipelines in limma, which can accept logCPM values, but even lmFit cannot accept CPM on an unlogged scale.

voom then applies a log2 transformation (roughly log2(count + offset))

It is correct that voom applies an offset and a log2 transformation, but you are forgetting the need for library size normalization, which is crucial. I already made this point in a comment to the OP above.

voom does use log2(count + offset) internally, without library size correction, to predict the variance trend, but the linear models and downstream analysis is on log2CPM values.

Counts 0–15 sound like log2(CPM) or log2(TPM) (common in voom pipelines)

Again, I suspect you are thinking of limma-trend with lmFit rather than limma-voom.

For transcript-level analyses, we recommend the use of divided transcript counts instead of log2(TPM), see Baldoni et al (2025):

Baldoni PL#, Chen L#, Li M, Chen Y, Smyth GK (2025). Dividing out quantification uncertainty enables assessment of differential transcript usage with limma and edgeR. Nucleic Acids Research (to appear). bioRxiv https://doi.org/10.1101/2025.04.07.647659

ADD REPLY
0
Entering edit mode

Thank you for the clarifications, Gordon.

ADD REPLY
0
Entering edit mode

Overall, you are right that voom will log-transform the input values, leading to the logFC relationships that you describe. Our worry however is that the OP's data seems to already be log-transformed, and applying a second log-transformation to already logged values, which is what voom would do if input such data, is quite wrong.

ADD REPLY
0
Entering edit mode

Indeed - I am in agreement. We don't always have all info available when we provide answers here. Thanks Prof. Smyth!

ADD REPLY
0
Entering edit mode
18 hours ago
Gordon Smyth ★ 8.5k

Since you have normalized RNA-seq data, apparently on a log2 scale and between 0 and 15, it may be that you have logCPM values, which are output by a number of Bioconductor packages. If that is so, then you can apply the limma-trend pipeline instead of limma-voom. This means that you skip voom but go straight to lmFit and then eBayes with trend=TRUE. Use plotSA to see the fitted variance trend. The purpose of voom is to convert counts into logCPM values with associated precision weights. If you already have logCPM to start with, then obviously you don't want to redo the transformation with voom.

Using limma-trend will result in a much closer relationship between the raw group means and the output logFC, because limma-trend will fit linear models directly to the values that you supply, instead of trying to reprocess them as if they were counts.

However, I feel that you really should go back to the people who supplied you with the data for the provenance of the data.

ADD COMMENT
0
Entering edit mode

Thanks Gordon - it's always pleasant and provides a 'ground truth' when the very author themselves respond!

ADD REPLY

Login before adding your answer.

Traffic: 3751 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6