Question

Limma false positives

0

Entering edit mode

23 months ago

bifi.ftw • 0

Dear Community,

I am still wrestling with limma, trying to apply it to my metabolomics data.

The main problem is the fact that, with my data, limma produces produces false positives, i.e. significant p-values when they clearly should not be significant.

I already had a very similar problem, which could be solved by using a started log-transformation to stabilize the variance.

With my new data set I have the same problem again. I suspect that I failed to stabilize the variance enough, which leads to false positives.

plotSA

I prepared my data the following way:

Normalized raw data
Removed metabolites where Missingness > 10% across test samples
Started log-transformation: log2( x + 0.5e-4 )
Blinded the experiment, because it is unpublished data.

For example for compound 8 enter image description here I get a limma p-value of 0.003, which is clearly wrong.
A Welch's test results in p = 0.4692.
I calculated the ordinary limma p-value and got 0.0035, which makes me think that I do not meet the requirements for linear regression.

Is my suspicion, correct or did I go wrong somewhere else? I would be very thankful, if somebody could point me into the right direction.

I used the following code:

# Load limma
library(limma)

# Load data
m <- as.matrix(read.csv(url('https://pastebin.com/raw/gppyEXyV'), row.names=1))
design <- read.csv(url('https://pastebin.com/raw/PwkP7u81'), row.names=1)
cont_m <- read.csv(url('https://pastebin.com/raw/6p5UNRK3'), row.names=1)

# Limma
fit <- lmFit(m, design)
fit <- contrasts.fit(fit, cont_m)
fit <- eBayes(fit, trend=TRUE)

# Check residuals
plotSA(fit)

# Data of compound in question
m['compound_8',c(10,11,12,45,46,47)]
dotchart(m['compound_8',c(10,11,12,45,46,47)])

# Limma result
topTable(fit, n=Inf) %>%
    tibble::rownames_to_column("unique_name") %>%
    filter(unique_name == 'compound_8')

# Welch's test
t.test(m['compound_8',c(45,46,47)],m['compound_8',c(10,11,12)])

# Ordinary p-value from limma
tstat.ord <- fit$coef / fit$stdev.unscaled / fit$sigma
p.value.ord <- 2 * pt( abs(tstat.ord), df=fit$df.residual, lower.tail=FALSE)
p.value.ord['compound_8',]

Thank you very much for your time!

limma metabolomics • 692 views

ADD COMMENT • link updated 23 months ago by Gordon Smyth ★ 7.0k • written 23 months ago by bifi.ftw • 0

score 0 · Answer 1 · 2022-05-06

Actually, you have no positives at all, let alone false positives. The smallest FDR in your comparison is 25%. You seem to be fishing around for positive results, ignoring adjustments for multiple testing, and that is just not a statistically defensible way to process a multivariate dataset.

The problem with your data is that compound_8 has a massive outlier. It looks very much like compound 8 was actually only detected in only one sample out of 65, and you have set all the non-detected cases to a small expression value far from the detected value. In other words, you have processed the data in such a way that even moderately expressed values are outliers compared to non-detected values.

If I was you, I would filter out all compounds that are detected in fewer than 3 samples. Filtering unexpressed features is a standard procedure in limma pipelines. For your data, that would remove compounds 1, 2, 3, 5, 8, 9 and 13. Compound 2 is never detected at a worthwhile level in any sample.

There's lots of other things you might consider. The data don't seem to be normalized -- some samples have systematically higher or lower expression than others. From an analysis point of view, you could estimate sample quality weights or set robust=TRUE in the call to eBayes.