Limma false positives
1
0
Entering edit mode
23 months ago
bifi.ftw • 0

Dear Community,

I am still wrestling with limma, trying to apply it to my metabolomics data.

The main problem is the fact that, with my data, limma produces produces false positives, i.e. significant p-values when they clearly should not be significant.

I already had a very similar problem, which could be solved by using a started log-transformation to stabilize the variance.

With my new data set I have the same problem again. I suspect that I failed to stabilize the variance enough, which leads to false positives.

plotSA

I prepared my data the following way:

  • Normalized raw data
  • Removed metabolites where Missingness > 10% across test samples
  • Started log-transformation: log2( x + 0.5e-4 )
  • Blinded the experiment, because it is unpublished data.

For example for compound 8 enter image description here I get a limma p-value of 0.003, which is clearly wrong.
A Welch's test results in p = 0.4692.
I calculated the ordinary limma p-value and got 0.0035, which makes me think that I do not meet the requirements for linear regression.

Is my suspicion, correct or did I go wrong somewhere else? I would be very thankful, if somebody could point me into the right direction.

I used the following code:

# Load limma
library(limma)

# Load data
m <- as.matrix(read.csv(url('https://pastebin.com/raw/gppyEXyV'), row.names=1))
design <- read.csv(url('https://pastebin.com/raw/PwkP7u81'), row.names=1)
cont_m <- read.csv(url('https://pastebin.com/raw/6p5UNRK3'), row.names=1)

# Limma
fit <- lmFit(m, design)
fit <- contrasts.fit(fit, cont_m)
fit <- eBayes(fit, trend=TRUE)

# Check residuals
plotSA(fit)

# Data of compound in question
m['compound_8',c(10,11,12,45,46,47)]
dotchart(m['compound_8',c(10,11,12,45,46,47)])

# Limma result
topTable(fit, n=Inf) %>%
    tibble::rownames_to_column("unique_name") %>%
    filter(unique_name == 'compound_8')

# Welch's test
t.test(m['compound_8',c(45,46,47)],m['compound_8',c(10,11,12)])

# Ordinary p-value from limma
tstat.ord <- fit$coef / fit$stdev.unscaled / fit$sigma
p.value.ord <- 2 * pt( abs(tstat.ord), df=fit$df.residual, lower.tail=FALSE)
p.value.ord['compound_8',]

Thank you very much for your time!

limma metabolomics • 692 views
ADD COMMENT
0
Entering edit mode
23 months ago
Gordon Smyth ★ 7.0k

Actually, you have no positives at all, let alone false positives. The smallest FDR in your comparison is 25%. You seem to be fishing around for positive results, ignoring adjustments for multiple testing, and that is just not a statistically defensible way to process a multivariate dataset.

The problem with your data is that compound_8 has a massive outlier. It looks very much like compound 8 was actually only detected in only one sample out of 65, and you have set all the non-detected cases to a small expression value far from the detected value. In other words, you have processed the data in such a way that even moderately expressed values are outliers compared to non-detected values.

If I was you, I would filter out all compounds that are detected in fewer than 3 samples. Filtering unexpressed features is a standard procedure in limma pipelines. For your data, that would remove compounds 1, 2, 3, 5, 8, 9 and 13. Compound 2 is never detected at a worthwhile level in any sample.

There's lots of other things you might consider. The data don't seem to be normalized -- some samples have systematically higher or lower expression than others. From an analysis point of view, you could estimate sample quality weights or set robust=TRUE in the call to eBayes.

ADD COMMENT

Login before adding your answer.

Traffic: 1577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6