Modeling Chip-Seq Background Using Edger'S Glm Functionality?
2
1
Entering edit mode
9.9 years ago
Ryan Thompson ★ 3.6k

In a previous question, I asked about how to do "background subtraction" for ChIP-Seq data that has matched input samples. Most of the answers there asserted that my idea of pre-processing my count data by subtracting an appropriately-scaled version of the control (i.e. ChIP-seq input) data was the wrong approach. Rather, I should provide the raw IP counts along with the input counts to edgeR and use its generalized linear modelling features to separate signal from background in a statistically rigorous way. I tend to agree with this idea, but I don't really know how to do it, because once I try to factor in the input-vs-IP contrast along with the rest of my experimental design, things get so complicated that I don't know how to construct the model matrix that describes my experimental design.

So, here is my sample table (with specific identifiers replaced by "Donor1", "Donor2", etc.):

sampledata <- data.frame(donor=rep(rep(c("Donor1", "Donor2", "Donor3", "Donor4"), each=2), 2),
celltype=rep(c("memory", "naive"), 8),
sampletype=rep(c("IP", "input"), each=8))
rownames(sampledata) <- do.call(sprintf,c("%s-%s-%s", as.list(sampledata)))


The table looks like this:

> print(sampledata)
donor celltype sampletype
Donor1-memory-IP    Donor1   memory         IP
Donor1-naive-IP     Donor1    naive         IP
Donor2-memory-IP    Donor2   memory         IP
Donor2-naive-IP     Donor2    naive         IP
Donor3-memory-IP    Donor3   memory         IP
Donor3-naive-IP     Donor3    naive         IP
Donor4-memory-IP    Donor4   memory         IP
Donor4-naive-IP     Donor4    naive         IP
Donor1-memory-input Donor1   memory      input
Donor1-naive-input  Donor1    naive      input
Donor2-memory-input Donor2   memory      input
Donor2-naive-input  Donor2    naive      input
Donor3-memory-input Donor3   memory      input
Donor3-naive-input  Donor3    naive      input
Donor4-memory-input Donor4   memory      input
Donor4-naive-input  Donor4    naive      input


There are 4 individuals (donors). Two cell types were isolated from each donor, and each sample was ChIPed, and both the IP and the input were sequenced. The result is 4 donor-matched pairs of memory/naive IP samples (rows 1 through 8), plus matched input samples for everything (rows 9-16). I want to find features with significant differences in the non-background between naive and memory in the IP samples, after controlling for inter-donor variation in the IP samples as well as any variation at all among the input samples.

Answers from the other question recommended that I should model the input samples as "Background" and the IP samples as "Background + signal", then allow edgeR to separate the signal from the background and do differential tests between the signals. This all makes perfect sense. However, as I said, I don't know how to construct the model matrix that describes such a setup.

So my question is this. What is the code for generating the appropriate model matrix for my experimental design? Typically the code would appear something like the following:

design <- model.matrix(???, data=sampledata)


where ??? is the formula that I don't know how to write. Alternatively, you could provide code that directly constructs the model matrix.

I realize that I'm just asking for the answer here to my specific question, but I'm hoping that the answer will be sufficiently illustrative that anyone who wants to use edgeR for ChIP-Seq will benefit from it. If I can get this working, I would like to submit my code to the edgeR developers for them to include as a case study to demonstrate use of edgeR on ChiP-Seq data.

More generally, is there a decent guide to writing complex formulas for the GLM functionality in edgeR and other packages? It seems like the documentation for edgeR and others assumes that you already know how to write such formulas, and simply explains how to plug them into the package using trivial examples with no more than two factors.

chip-seq edger • 4.3k views
3
Entering edit mode
9.9 years ago
seidel 9.5k

"It seems like the documentation for edgeR and others assumes that you already know how to write such formulas..."

You can say that again. I find it confusing as well, but am usually able to figure it out. Gordon Smyth is usually very good at answering questions about design matrices on the bioconductor mailing list, and if you search around you'll find several examples to puzzle through. But I've never read a general guide written for non-statisticians.

I think this is what you're after. I re-arranged your matrix because it's easier for me to think about the resulting design matrix.

sampledata <- sampledata[order(sampledata$celltype),] print(sampledata) donor celltype sampletype Donor1-memory-IP Donor1 memory IP Donor2-memory-IP Donor2 memory IP Donor3-memory-IP Donor3 memory IP Donor4-memory-IP Donor4 memory IP Donor1-memory-input Donor1 memory input Donor2-memory-input Donor2 memory input Donor3-memory-input Donor3 memory input Donor4-memory-input Donor4 memory input Donor1-naive-IP Donor1 naive IP Donor2-naive-IP Donor2 naive IP Donor3-naive-IP Donor3 naive IP Donor4-naive-IP Donor4 naive IP Donor1-naive-input Donor1 naive input Donor2-naive-input Donor2 naive input Donor3-naive-input Donor3 naive input Donor4-naive-input Donor4 naive input # create factors celltype <- factor(sampledata$celltype)
sampletype <- factor(sampledata\$sampletype)

# create the design matrix
design <- model.matrix(~celltype+celltype:sampletype)


If you examine the matrix you can see what's going on: "(Intercept)" "celltypenaive" "celltypememory:sampletypeIP" "celltypenaive:sampletypeIP". Your main interest is the interaction between celltype and sample type. The first coefficient is the ratio between cell types (not interesting), but the second and third coefficients are the IP/input ratios for memory and naive cells respectively. So get the values for those coefficients.

# assuming y is your DGEList object
# all norm factored and dispersioned up
fit <- glmFit(y,design)

# get the main coefficients of interest
memory <- glmLRT(y,fit,coef="celltypememory:sampletypeIP")
naive <- glmLRT(y,fit,coef="celltypenaive:sampletypeIP")

# take the contrast between the two to find the differences
mem_nai_contrast <- glmLRT(y,fit,contrast=c(0,0,1,-1))


This is probably the simplest approach because it treats your donors as simply 4 biological replicates, and ignores that fact that they're each matched between IP and input. But it might get you on the map in terms of at least looking at the data.

1
Entering edit mode

Ok, so I tried this approach and I found that I got more significant p-values and FDR values when I left the input samples out of the analysis entirely and just compared IP vs IP directly. In fact, when using the input samples in my analysis, no promoters were significant at a 0.05 FDR. I wonder if my design was wrong somehow, or if there's something inherently wrong with the "difference-of-differences" approach. It doesn't seem plausible that all the significant promoters in the no-input analysis are the result of changes in the input rather than changes in the IP. I'm not really sure what to make of this.

0
Entering edit mode

That's interesting. Have you looked at the results in a genome browser? Specifically, if you create RPM tracks of the IPs and inputs, and just compare them by eye, do the results make sense? You might also try some sanity checks via Marson et al. & Young, Cell 2008 (pmid: 18692474, see the supplement) where you evaluate the enrichment level of your promoters via 25 bp bins, and then compare the enrichments between cell types to see if it's all just scatter, how it might compare to other chosen locations, or if the differences you're seeing make sense. It's a lot of coding in R, but you could get a good picture of the situation.

0
Entering edit mode

Note that you can skip the "create factors" section of your code and instead to the following for creating the matrix:

design <- model.matrix(~celltype+celltype:sampletype, data=sampledata)

0
Entering edit mode

I wonder if it is sufficient to just add +donor to the formula that you give in order to get paired tests?

0
Entering edit mode

9 years later...has this here ever come to a working solution? I wonder how, despite all the modeling approaches, you address the compositional differences between IP and input and therefore a decent (e.g. TMM) normalization.

1
Entering edit mode
9.9 years ago
Neilfws 49k

To answer your last, more general query: the limma user guide (PDF) is an excellent resource. Obviously its main focus is use of the limma package for microarray analysis, but it includes very good discussions of linear models in general (section 7) and model matrix design (section 8), with case studies (section 11).

0
Entering edit mode

+1, chapters 7 and 8 seem quite informative.