Question

limma overfitting model due to intercept and yet exclusion of intercept= uninterpretable results

0

Entering edit mode

8.0 years ago

chrisclarkson100 ▴ 150

I have microarray data for which the columns represent the disease condition status of each individual.

I plan to fit a linear model and then make a contrast amongst the different columns using limma. My question is: how do I make the most informative comparison when contrasting all of the columns against all of the other disease stasuses...

Other<-ifelse(Information_file$Diagnostic.category=="Other diagnoses", 1,0)

TB<-ifelse(Information_file$Diagnostic.category=="TB", 1,0)

So as you can see above I am trying to test for a significant difference between TB and all other diagnoses and I need to eliminate any confounding to see if there are any differentially expressed genes that are genuinely due to TB and not just one of the other diagnoses....

The first attempt that I made at this was using the following design matrix:

head(design)

     TB Intercept Other Sex Age   
[1,]  1         1     0   1 155  
[2,]  1         1     0   2  16   
[3,]  1         1     0   1  22  
[4,]  1         1     0   1 114   
[5,]  1         1     0   2  56  
[6,]  1         1     0   2  47

And the following code produced a nice plot.

fit<-lmFit(E.ncRNA1, design)
contrast.matrix <- makeContrasts(TB,levels=design)
fit <- contrasts.fit(fit, contrast.matrix)
fit <-eBayes(fit)
volcanoplot(fit, highlight=10, main="TB vs Everything")

enter image description here

However my supervisor says that this is not statistically robust as:

including an intercept over-specifies the model he'd like to know if there's a non-zero difference between the TB and Other columns i.e. specifying 1 for the TB covariate and -1 for the OD covariate Hence, lets say my new design matrix is designed as follows:

head(design)

             TB   Other   Sex      Age  
    [1,]     1       0       1    155  
    [2,]     1       0       2     16   
    [3,]     1       0       1     22  
    [4,]     1       0       1     114  
    [5,]     0      -1       2     56   
   [6,]      0      -1       2     47

And yet the exclusion of an Intercept gives rise to plots that are completely uninformative....

So overall my questions are as follows how do I make the most valuable comparison to test for a non-zero difference between TB and other diseases without including an intercept?

limma • 3.1k views

ADD COMMENT • link 8.0 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

Sorry, I couldn't tell from your description but are there any rows corresponding to individuals who neither had 'TB' nor 'Other diagnoses' as their Diagnostic.category? From your first design matrix it looks like the intercept column is the sum of the TB and the Other column, so that your design matrix isn't of full rank (not that the model is overfitting). In my experience limma aborts in that setting.

ADD REPLY • link 8.0 years ago by russhh 5.7k

0

Entering edit mode

Hi there the @russhh the Intercept column just provides a null model as it is all 1's I was told to do it this way but I am new to linear algebra- I am checking up the new definition of rank in matrices on Wikipedia... if you know of any good tutorials in terms of using design matrices could you send them my way?? Also yes there are some individuals who are not either of TB or Other..... thanks

ADD REPLY • link 8.0 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

Thanks. It sounds to me like your boss thinks there are only 'TB' and 'Other' in the experiment. Double check whether they mean for you to include 'anything other than TB' as your 'Other' category: in this setting, there's no need to specify an intercept.

If, however, you are to keep 'TB', 'Other' and all the other rows, you MUST put in an intercept term so that the baseline expression for individuals who are neither 'TB' nor 'Other' is fitted appropriately.

The final possibility is for those rows that are neither 'TB' nor 'Other diagnoses' to be dropped prior to your analysis. In this setting you can use your second design (although it's a bit odd to use -1 as an indicator, it would make more sense to use 1)

Those 155 year old TB patients are doing pretty well though

ADD REPLY • link 8.0 years ago by russhh 5.7k

0

Entering edit mode

haha yeah that's measured in months but ill predict their life expectancies later for fun ;)

ADD REPLY • link 8.0 years ago by chrisclarkson100 ▴ 150

0

Entering edit mode

I would imagine that assigning a value of -1 to the coefficients that might then be used in a subtraction in a contrast would lead to issues...

BTW, I expect the results will end up largely the same with/without the intercept.

ADD REPLY • link 8.0 years ago by Devon Ryan 104k