Question

Limma model gives weird results

0

Entering edit mode

6 months ago

leranwangcs ▴ 150

Hi,

I have been trying to design a Limma model for my analysis. I tried two different datasets, one is with 3% prevalence filtering which contains 6333 contigs, the other one is with 3% prevalence filtering, which only have 182 contigs left. My model design is:

model.design <- model.matrix(~ Dx.Status + Country + month + solid_food + genotype + feeding_type,metadata.clean)
model.fit <- voomLmFit(contig.abundance.table,model.design,plot=FALSE)
model.fit <- eBayes(model.fit)
sig.model.res <- topTable(model.fit,coef="DiseaseOn",number=Inf) %>% filter(adj.P.Val < 0.05)

However I got 0 significant contig.

And for the other datasets that contains 6333 contigs, I used the same model but got 6280 significant contigs. I checked the result table, 4000+ out of 6280 contigs have exactly the same p-adj values.

I wondered if there is anything wrong with my model design, and how to interpret these two outputs.

Thanks!

model multivariate Limma • 1.1k views

ADD COMMENT • link 3 months ago by leranwangcs ▴ 150

1

Entering edit mode

The code that you show cannot be correct, because the topTable coef refers to a term that is not in the design matrix. The code as shown will give an error.

Also, I don't know what you mean by "prevalence filtering". You would need to give more context.

As a more minor comment, you could set p=0.05 in topTable, then you wouldn't need filter().

ADD REPLY • link 6 months ago by Gordon Smyth ★ 8.4k

0

Entering edit mode

thanks! There was a typo in the model and it was supposed to be:

model.design <- model.matrix(~ Dx.Status + Country + month + solid_food + genotype + feeding_type,metadata.clean)
model.fit <- voomLmFit(contig.abundance.table,model.design,plot=FALSE)
model.fit <- eBayes(model.fit)
sig.model.res <- topTable(model.fit,coef="Dx.StatusOn",number=Inf) %>% filter(adj.P.Val < 0.05)

By prevalence filtering, I mean I only keep those contigs that exist in more than 3% of the samples. This step caused a dramatic drop in the number of contigs (6333 contigs -> 182 contigs).

ADD REPLY • link 6 months ago by leranwangcs ▴ 150

1

Entering edit mode

However I got 0 significant contig.

That alone does not necessarily indicate a problem. There could in reality be non differences, or noise/unwanted variation keeps you from detecting it.

And for the other datasets that contains 6333 contigs, I used the same model but got 6280 significant contigs.

Strongly suggesting a flaw in either code, dataset or normalization. Literally all genes being different is inherently incompatible with how OMICS works which is usually a relative readout. This often happens e.g. when people put an intercept-less design and then use wrong coefficients etc.

I checked the result table, 4000+ out of 6280 contigs have exactly the same p-adj values.

Ties in adj.P.Val are not worrying as the BH-correction often produces these. This assumes code and data were handled correctly by you.

I wondered if there is anything wrong with my model design, and how to interpret these two outputs.

Cannot tell, you don't show reproducible code as here, as Gordon Smyth says the code should error as the coefficient is not in the design.

Generally, summarizing my above comments I assume here we find a combination of non-proper code / use of models, normalization and dataset handling.

ADD REPLY • link 6 months ago by ATpoint 89k

0

Entering edit mode

Sorry it takes so long to come back to this. Let me refine my question:

The abundance table is a 2313 viral ORFs by 306 samples table, which has a sparsity of 95%, and this is after 3% prev-filtering.
This is a longitudinal cohort and the metadata contains variables such as patient_ID, disease_status, timepoints (I set it as a continuous variable), gender, country, feeding_type, deliver_mode.
The main question I'm interested in is, are there significant orfs that are different between control and disease (disease_status), in terms of timepoints.
I fit a limma model like this:

   linear.model.design <- model.matrix(~ disease_status* timepoints + Country + Sex + feeding_first_year +Delivery.Mode, metadata)

    linear.model.fit <- voomLmFit(orf.abundance.table,linear.model.design,block= metadata$patientID,plot=FALSE)

    linear.model.fit <- eBayes(linear.model.fit)

    sig.model.res <- topTable(linear.model.fit,coef="disease_statusCONTROL:timepoints",number=Inf) %>% filter(adj.P.Val < 0.05)

The sig.model.res gives 2009 significant Orfs. Which is over 86% of the total number of Orfs in this cohort.

I'm not sure how to proceed next. Should I trust this high rate of significance? How do I identify if these orfs are true positive or just false positive caused by the sparsity of my count table? If they are False positive, what I can do to find true biological significant orfs from my data?

Thanks!!

ADD REPLY • link 3 months ago by leranwangcs ▴ 150

1

Entering edit mode

It is not clear to me how many samples (columns) are there in the contig.abundance.table. As the model is fitting 6 coefficients...

keep those contigs that exist in more than 3% of the samples may imply that there are a lot of zero in the table. What is the overall percentage (or count) of zero after the filtering?

ADD REPLY • link 6 months ago by SamGG ▴ 150