Question: Multivariable linear regression using Limma package; Model Matrix design
3
2.2 years ago by
erfan74130
erfan74130 wrote:

I have been struggling with this for a while. I appreciate if someone can help:

I am running a linear model using limma package on a set of micro-array probes and a disease phenotype adjusting for a confounding factor. Ideally I want my model to be like this:

Disease phenotype (continuous) ~ Micro-array probes + confounding factor

But in Limma package, There is no way to have the probe matrix in the middle. It's always the response variable. Whatever I put in my model matrix will act as the independent variable as it is seen here, which is not what I want:

lmFit(probe_matrix, design=model.matrix(~Disease phenotype+confounding factor))

Is there a way around this?

modified 2.2 years ago • written 2.2 years ago by erfan74130
1
2.2 years ago by
erfan74130
erfan74130 wrote:

I think I figured it out myself. Apparently this is the only way:

pvals<-sapply(1: dim(probe_matrix[1], function(x) eBayes(lmFit(Disease_phenotype, design=model.matrix(~t(probe_matrix[x,])+t(confounding factor))))\$p.value)

To elaborate on this; don't do this.

Do instead '0 + mainEffect1 + covariate1)

1
2.2 years ago by
Steven Lakin1.4k
Fort Collins, CO, USA
Steven Lakin1.4k wrote:

Programs like Limma force the gene expression values to be the response variable because that is the correct way to model it:

lmFit(probe_matrix, design = model.matrix( ~ 0 + Disease phenotype + covariate )

This will calculate regression coefficients for disease phenotype and the covariate as a function of the data. You'll get an F-statistic for whether or not your coefficient on disease phenotype is significant. You can pull out the adjusted p-value for whether or not the levels of the disease phenotype factor are different by using contrasts. The limma code examples should be able to walk you through that, something to this effect:

# Not tested
fit <- lmFit(probe_matrix, design = model.matrix( ~ 0 + Disease phenotype + covariate )
fit <- eBayes(fit)
topTable(fit, coef=2)

contrast_matrix <- makeContrasts(contrasts=c("Disease - Control"), levels=c("Disease", "Control"))
fit2 <- contrast.fit(fit, contrast_matrix)
topTable(eBayes(fit2), coef=2)

If you're having trouble understanding why the above works the way it does, you should look into how linear regression is calculated more closely first and then come back to the limma code.

In my opinion a~b+c is a different model from b~a+c and whatever values you get will also be different. I am not sure why you say the correct way is to have array data as response all the time. You should have the flexibility to treat them as response or independent depending on your hypothesis or study design. Could you clarify me on this? is there anything I m missing here? I know the basics of regression etc.

I think I understand what you're trying to do now. Apologies; it is the first time I've seen someone request this in bioinformatics, as it is somewhat strange to model this way. What you will have to do is construct your model matrix like so (this assumes your probe matrix is, by native format, in an M probes by N samples orientation):

model <- cbind(t(probe_matrix), covariate_vector)

Note that this results in a matrix where the dimensions are N samples (rows) by M probes + 1 (columns), where the + 1 is the column vector for whichever covariate you want to include. This will be a huge model matrix if your probes are numerous. You'll then need to call lmFit using the disease phenotype vector as the predicted variable matrix:

lmFit(disease_phenotype_vector, design = model)

I'm not even sure this will work, but it will be interesting for you to try and see. If it doesn't work in limma, I'd recommend using simple lm(), because with the correct dimensions, that will solve correctly for the coefficient vector. What you'll end up with are M + 1 coefficients that describe the importance of every probe as well as the covariate.

I'd have to see the output, but I would imagine that the results from this will be difficult to interpret at best and incorrect at worst due to overfitting. Let us know what you find!

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Steven Lakin1.4k

Well this doesn't work. This is like you have one linear model in which all the probes are the co-variables (a million covariables in one model!). Due to co-linearity the model produces NA values and will stop working. But good thinking anyways!!

Right, which is what we product based on standard logic. Let me know if you're interested in the standard experimental designs.

not sure what you mean by standard experimental design. I already had done the analysis by making a loop in r for all the probes. I was interested in limma because it s faster and it moderates standard errors using a bayesain approach. But you gave me a hint. I can use the above code for one probe in the model and make a loop to repeat this analysis for all the probes. That will work, although would be slow again!

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by erfan74130
1

Standard regression analysis for count data is based on modeling expression values as a function of explanatory variables that are part of the experimental design:

y ~ 0 + effect1 + effect2 + (1 | effect 3)

For example. Typically, we want to explain the expression values as a function of some systematic effects that we believe drive the biology. What I believed you wanted was to model the disease phenotype as a function of probe values, which, as you stated, produces a model matrix of a million covariates. That is literally the difference between a ~ b +c and b ~ a + c. If you don't grasp that difference, then I suggest going back to the books in terms of regression modeling, since this is basic 101 stuff; we explain some kind of variable (typically the data we measure based on sequencing) as a function of our experimental design.

To elaborate on what most people do with these experimental designs: ~ 0 + DiseasePhenotype+ covariate + (1 | randomCovariate)

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Steven Lakin1.4k