In a previous question, I asked about how to do "background subtraction" for ChIP-Seq data that has matched input samples. Most of the answers there asserted that my idea of pre-processing my count data by subtracting an appropriately-scaled version of the control (i.e. ChIP-seq input) data was the wrong approach. Rather, I should provide the raw IP counts along with the input counts to edgeR and use its generalized linear modelling features to separate signal from background in a statistically rigorous way. I tend to agree with this idea, but I don't really know how to do it, because once I try to factor in the input-vs-IP contrast along with the rest of my experimental design, things get so complicated that I don't know how to construct the model matrix that describes my experimental design.
So, here is my sample table (with specific identifiers replaced by "Donor1", "Donor2", etc.):
sampledata <- data.frame(donor=rep(rep(c("Donor1", "Donor2", "Donor3", "Donor4"), each=2), 2), celltype=rep(c("memory", "naive"), 8), sampletype=rep(c("IP", "input"), each=8)) rownames(sampledata) <- do.call(sprintf,c("%s-%s-%s", as.list(sampledata)))
The table looks like this:
> print(sampledata) donor celltype sampletype Donor1-memory-IP Donor1 memory IP Donor1-naive-IP Donor1 naive IP Donor2-memory-IP Donor2 memory IP Donor2-naive-IP Donor2 naive IP Donor3-memory-IP Donor3 memory IP Donor3-naive-IP Donor3 naive IP Donor4-memory-IP Donor4 memory IP Donor4-naive-IP Donor4 naive IP Donor1-memory-input Donor1 memory input Donor1-naive-input Donor1 naive input Donor2-memory-input Donor2 memory input Donor2-naive-input Donor2 naive input Donor3-memory-input Donor3 memory input Donor3-naive-input Donor3 naive input Donor4-memory-input Donor4 memory input Donor4-naive-input Donor4 naive input
There are 4 individuals (donors). Two cell types were isolated from each donor, and each sample was ChIPed, and both the IP and the input were sequenced. The result is 4 donor-matched pairs of memory/naive IP samples (rows 1 through 8), plus matched input samples for everything (rows 9-16). I want to find features with significant differences in the non-background between naive and memory in the IP samples, after controlling for inter-donor variation in the IP samples as well as any variation at all among the input samples.
Answers from the other question recommended that I should model the input samples as "Background" and the IP samples as "Background + signal", then allow edgeR to separate the signal from the background and do differential tests between the signals. This all makes perfect sense. However, as I said, I don't know how to construct the model matrix that describes such a setup.
So my question is this. What is the code for generating the appropriate model matrix for my experimental design? Typically the code would appear something like the following:
design <- model.matrix(???, data=sampledata)
??? is the formula that I don't know how to write. Alternatively, you could provide code that directly constructs the model matrix.
I realize that I'm just asking for the answer here to my specific question, but I'm hoping that the answer will be sufficiently illustrative that anyone who wants to use edgeR for ChIP-Seq will benefit from it. If I can get this working, I would like to submit my code to the edgeR developers for them to include as a case study to demonstrate use of edgeR on ChiP-Seq data.
More generally, is there a decent guide to writing complex formulas for the GLM functionality in edgeR and other packages? It seems like the documentation for edgeR and others assumes that you already know how to write such formulas, and simply explains how to plug them into the package using trivial examples with no more than two factors.