Question

Chipseq Background Subtraction For Count-Based Differential Binding Methods?

3

Entering edit mode

11.7 years ago

Ryan Thompson ★ 3.6k

I am experimenting with using edgeR on some ChIP-Seq data to discover regions that are differentially bound between experimental conditions. For now, my regions of interest are annotated gene starts from UCSC +/- 1000 bp, though later I will use a peak caller on the data to define regions of interest. The input to edgeR is a matrix of read counts, with each column being a different sample and each row being a different genomic feature. These counts should be totally un-normalized.

I want to do background subtraction on these counts using the ChIP-Seq input samples that I have, and then feed the background-subtracted counts to edgeR (replacing negatives with zeros). However, naturally the library sizes differ between each ChIP and its corresponding input sample, so in order to do subtraction I would have to normalize one to the other. The most straightforward way I can think of is to normalize each input to have the same total number of mapped reads as the corresponding ChIP and then subtract the input counts from the ChIP counts for each feature. However, I suspect that in this case I would effectively be overstating the precision of the counts that I provide to edgeR, since the subtracted counts are subject to the variation in both the ChIP and the control samples. Furthermore, the counts are not really un-normalized any longer, since they are the subtraction of a normalized control count from a raw ChIP count. Is it still appropriate to feed these count data into edgeR to search for differential binding? Is there a better way to do background subtraction in the context of differential binding when using tools like edgeR that take raw counts as input?

chip-seq edger • 8.1k views

ADD COMMENT • link updated 7.2 years ago by Biostar 20 • written 11.7 years ago by Ryan Thompson ★ 3.6k

1

Entering edit mode

EdgeR is designed to deal with transcriptome data, for ChiP-seq there are other Bioconductor packages, from the website:

ChIP-seq and related (e.g., motif discovery, identification of high-coverage segments) activities are facilitated by packages such as CSAR, chipseq, ChIPseqR, ChIPsim, ChIPpeakAnno, DiffBind, iSeq, rGADEM, segmentSeq, BayesPeak, PICS.

ADD REPLY • link 11.7 years ago by JC 13k

1

Entering edit mode

This isn't true; it's generally for count data, which can arise in multiple experimental designs. To quote from their paper: "edgeR may also be useful in other experiments that generate counts, such as ChIP-seq, in proteomics experiments where spectral counts are used to summarize the peptide abundance, or in barcoding experiments where several species are counted."

ADD REPLY • link 11.7 years ago by matted 7.8k

1

Entering edit mode

you're right, they mentioned it in the paper and the vignette, but they only provide examples for transcriptome data

ADD REPLY • link 11.7 years ago by JC 13k

score 3 · Answer 1 · 2012-08-21

3

Entering edit mode

11.7 years ago

seidel 11k

When considering EdgeR or DESeq for ChIP experiments, I don't think you should subtract background. Rather I think of the background population as simply another population for comparison to the IP to answer the following question: which loci in the IP population are different than the input (background) population? Thus you would simply be comparing your start-site windows between IP and input. If you're using edgeR then you could perform a contrast between these comparisons to compare different IPs. The normalization for library sizes would all be taken care of by the regular edgeR pipeline. In other words - treat your count data as if it's counts on genes or counts on start windows (edgeR doesn't know the difference). If it were genes you'd be comparing condition 1 vs condition 2. If it were start windows you'd be comparing IP and input. Another way to proceed would be to ignore the background and simply compare the two IPs, at least in experiments that use the same input. I haven't tried edgeR or DESeq for ChIP Seq (yet), but that's how I understood it.

ADD COMMENT • link 11.7 years ago by seidel 11k

0

Entering edit mode

Would it be possible for you to provide example code for building the model matrix that I would give to edgeR to specify this model? I understand it in theory, but I'm not sure how to properly write an R formula to describe this model.

ADD REPLY • link 11.7 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

This is what I suggested in my answer. Here's an example for making a design matrix:

condition <- factor(c(1,1,2,2,3,3))
batch <- factor(c("IP","control","IP","control","IP","control"))
design <- model.matrix(~condition+batch)
# use it with edgeR:
y <- calcNormFactors(y)
y <- estimateGLMCommonDisp(y, design, method="deviance", robust=TRUE, subset=NULL)
fit <- glmFit(y, design)

ADD REPLY • link 11.7 years ago by matted 7.8k

0

Entering edit mode

If I understand correctly, that code should give a list of features that are significantly differentially regulated between control and IP consistently across conditions? In the context of ChIP-seq, this would effectively be a list of features that are bound by the immunoprecipitated protein, correct? If the above is correct, then I don't see how one gets a list of features that are differentially regulated between different conditions.

ADD REPLY • link 11.7 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

You can test the necessity of various coefficients in the model, or the significance of the difference between them. It all depends on what you pass to glmLRT when you do the testing in the fitted model. So you could set up the correct contrast to test if e.g. 1 vs. 2 is a significant difference. Check out section 2.6 of the edgeR user's guide.

ADD REPLY • link 11.7 years ago by matted 7.8k

score 2 · Answer 2 · 2012-08-21

2

Entering edit mode

11.7 years ago

matted 7.8k

This talk of "background subtraction" of count data and truncating subtracted counts to 0 scares me. A better approach is to embrace the count data as counts and think about and use discrete models that work with this data. For example, you might think about Poisson models where your inference task is deciding whether means are the same or different between your conditions (this is a special case of the models that tools like edgeR use).

As I understand your question, you want to detect differences between several experimental conditions, where each condition has a (possibly varying) control. We can think of the counts as arising from this schematic model (I'm ignoring library normalization and replicates here, but edgeR won't):

Control_1_counts ~ Poisson(Condition1_mean)
ChIP_1_counts ~ Poisson(Condition1_mean + Condition1_effect)
Control_2_counts ~ Poisson(Condition2_mean)
ChIPl_2_counts ~ Poisson(Condition2_mean + Condition2_effect)

So you want to test whether Condition1_effect is significantly different from Condition2_effect, which is now valid since the difference in the controls is accounted for in the model.

edgeR (and other tools) provides precisely this functionality, where you can give it a design matrix and test for specific contrasts (whether pairs of coefficients in the model are different). See section 2.6 in the edgeR user's guide, "More complex experiments (glm functionality)" for details.

Overall, my main point is that it's better to specify a model that follows your experimental design and allows you to fit the observed data in its native form (raw counts), without ad-hoc transformations that may break the statistical assumptions of later steps.

ADD COMMENT • link 11.7 years ago by matted 7.8k

0

Entering edit mode

Well, it scares me too. :) That's why I'm asking here if there is a better way. I have experimented a little bit with the glm functionality of edgeR, but I have only copied examples, and I'm not well-versed in it by any means. I'm happy to hear that the glm functionality should be able to handle this.

ADD REPLY • link 11.7 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

I think this sounds reasonable, but I am not sure how to construct the design matrix. It seems that in all the docs that I've read on glm functionality in edgeR and similar packages, the documentation assumes that I already know how to create the model matrices for my experimental design, and so it just explains how to use the design matrix, and the example it provides for creating a design matrix is so trivial that it doesn't help me with my real-world design. Is there a guide somewhere for writing the formulae that one passes to model.matrix? I can come up with a formulat that I think is the right one, and I can run edgeR with it, and it will give me some numbers at the end, but I still don't know if the model matrix that I used really matches my experimental design, because I'm just guessing at the appropriate formula.

So is there a guide somewhere on translating my experimental design into a formula or model matrix? If there is, I haven't seen it.

If you're feeling inclined to just give me the answer for my particular case, here's a suitably scrubbed version of my sample information:

sampledata <- data.frame(donor=rep(rep(c("Donor1", "Donor2", "Donor3", "Donor4"), each=2), 2), celltype=rep(c("memory", "naive"), 8), sampletype=rep(c("IP", "input"), each=8))

There are 4 individuals (donors). Two cell types were isolated from each donor, and each sample was ChIPed, and both the IP and the input were sequenced. The result is 4 donor-matched pairs of memory/naive IP samples (rows 1 through 8), plus matched input samples for everything (rows 9-16). I want to find features with significant differences between naive and memory in the IP samples, after controlling for inter-donor variation in the IP samples as well as any variation at all among the input samples.

My best guess for the model matrix is this:

design <- model.matrix(~sampletype*donor+celltype, data=sampledata)

but it's just a guess and I don't really know how to look at the resulting matrix of zeroes and ones and tell whether it correctly describes my experimental design. Also, based on your answer above I manually constructed the following matrix, which I think matches the "background vs. background+signal" concept:

cbind(memorybg=sampledata$celltype=="memory",
      memorysignal=sampledata$celltype == "memory" & sampledata$sampletype=="IP",
      naivebg=sampledata$celltype=="naive",
      naivesignal=sampledata$celltype == "naive" & sampledata$sampletype=="IP") + 0

but I don't think this manually-constructed matrix takes into account the fact that the samples are matched pairs of cell types from the same donors, and I'm not sure how I would add this. And I still don't really know what the matrix means.

Maybe I should start a new question for this?

ADD REPLY • link 11.7 years ago by Ryan Thompson ★ 3.6k