Question: Differential expression using microarray data and limma, where am I going wrong?
0
3.9 years ago by
Tom30
Tom30 wrote:

Hi all,

I have a matrix; on the x axis is samples of say 100 humans of different ages, on the y axis is genes, and the cells in the matrix are log transformed microarray gene expression values.

For example:

``````              Age20      Age21       Age30      Age40 ........-> Age90
Gene1          4.3        -2.1        1.2        4.5............. -1.2
Gene2          1.1        1.0         -1.9       2.3.............  3.1
Gene3          1.2        -2.3        3.4        0.4.............  2.3
Gene4         -1.2        -0.2        1.2        1.2.............  1.2
Gene5         -0.9         0.2        0.3        -0.4............  1.4
``````

So basically, I have a matrix prepared; all I want to do is put this into limma, and get back a set of genes whose expression changes with age; so that I can compare between this method and another method I used.

I wrote this code:

``````library(limma)
fit <- lmFit(table) # What should the design be?
fit <- eBayes(fit)
options(digits=3)
writefile = topTable(fit,n=Inf,sort="none", p.value=0.01)
write.csv(writefile, file="file.csv")
``````

The problem: The code runs, but the file is empty. Is it because I didn't add a design parameter to my model? If so, what should the design be? I can find lots of stuff online, but a lot of it does not start from having a gene expression matrix, or involves adding in extra file info. Is it not possible to simply read in a table like above, and tell me which genes' expression are not remaining constant with age?

modified 3.9 years ago • written 3.9 years ago by Tom30
2

The design will be governed by your experiment: what are the sources of inter-sample variation? If age is the only one, then you could set up a linear or quadratic design that models the change in gene expression with age:

``````age <- c(20, 21, 30, 40, ..., 90)

design <- model.matrix(~ age)
``````

However, it's very likely, in a dataset of this size, that there are some batch effects that may need to be accounted for (were all the individuals recruited / samples extracted / arrayed at the same time)

Thank you. I'm just doing a basic analysis now, and then will get into batch effects. Regarding this method, I have a few data sets for different species to check, some are as described as above, but then some, instead of each individual being a different age, there are groups of ages (e.g. I have 5 samples @6 months, 5 samples at 16 months, 5 samples @ 24 months).

This is exact data for another similar data set:

``````Gene    6_month 6_month 6_month 6_month 6_month 16_month    16_month    16_month    16_month    16_month    24_month    24_month    24_month    24_month    24_month
11287   20.09   20.1    20.61   20.35   22.56   20.20   20.14   19.99   19.96   20.10   20.3693325428   20.14   20.08 20.219    20.24
11303   20.15   19.95   19.98   19.94   19.31   20.22   19.98   20.23   20.13   20.07   20.03   20.05   20.04   20.04   19.75
``````

So here I want to know, what genes are differentially expressed with age.

``````library(limma)
age <-c("Gene",6,6,6,6,6,16,16,16,16,16,24,24,24,24,24)
fit <-lmFit(table,design)
Error in lmFit(table, design) :
row dimension of design doesn't match column dimension of data object
dim(table)
6710   16
``````

So as you can see, I think it's saying that the number of items in the "age" list is not the same as the number of columns in the data, but it is? As a test, I transposed the data (table = t(table)), but I see the same error? Also, in the age vector, is it ok to label the first item as "Gene"? (i.e. this is just the gene name, I don't want to include it in maths calculations, even though each gene is a number).

1

The best thing you could do is try to get your data into an ExpressionSet structure (if you carry on along your current path, you'll end up reverse engineering the ExpressionSet)

That 'Gene' column shouldn't be included in your matrix of expression data, but you could (and should) use it to define the rownames of the expression data.

I said you should use model.matrix to define your design, assuming you had some experience with the limma workflow. You should be encoding the sample-specific information in the design matrix, so no: 'Gene' shouldn't be present in the vector `age` - age should be a numerical representation of the ages of your samples.

1

The file is likely empty because you specified `p.value=0.01`, which means that any results with p>=0.01 will not be included. This means you probably don't have any significant changes at that level. At any rate, consider using `write.fit()`, and also, I would follow @russhh's advice on how to include age in your model.

Very interesting thread. I am new to differential gene expression and would like to understand more about it. In this case, what is the meaning of the p-value of each gene generated by the limma model? Thanks!