Question

ChIP-Seq: Calling peaks with replicates

12

Entering edit mode

10.2 years ago

dariober 15k

In short: Is there are a ChIP-Seq peak caller that accounts for replicates? If no, what is your recommended way to use replicates.

A bit more verbose: It seems to me that all the peak callers available call peaks in a single pull-down experiments (See here for some popular programs). They do this using a wide variety of methods and sophistication.

However, if you have (and you should have...) replicates of the same experiments, it remains unclear how to make the best use of the variability between replicates. Unless I've missed it, there is no peak caller designed for that. Very often I see a peak called in one replicate which is missed in another replicate, even if a "bump" is definitely there.

In my opinion/experience, the options available to combine replicates are:

Irreproducible discovery rate. I've hear some skepticism about it. And does it work for more than two replicates?
Call peaks on individual replicates and use some sort of heuristics to define the final set (e.g. peaks in n out of m replicates and/or combine p-values from different replicates). In the final set, one would like to have for each peak the position, an estimate of significance, enrichment, etc. How to get these information is not obvious.
Just combine the individual input files and call peaks on that. This the easiest option but you obviously throw away the information of the sample to sample variability.

Any thoughts/ideas?

Thanks!

ChIP-Seq • 15k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by dariober 15k

1

Entering edit mode

IDR pipeline from Anshul will work on 2+ replicates, you would just have to do like a round-robin type of comparison (A vs B, A vs C, B vs C) at one of the steps

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Ying W ★ 4.3k

Ram · Answer 1 · 2014-09-16

6

Entering edit mode

10.2 years ago

Istvan Albert 101k

I prefer calling peaks on each replicate separately then making use of those that are common in all to create a "gold standard" of the most reliable calls. Use these to find motifs or other secondary sources of information that could help refine the remaining calls (distance to TSS, functional annotations of downstream genes etc)

Then I'd move to rescuing peaks with one missing call among replicates. For example the signal is small and under the statistical significance for one replicate but otherwise otherwise the site conforms to the characteristics that the gold standard peaks have. I would count these as well. Of course it is somewhat subjective but then you always have a reliable base that drives the process.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by Istvan Albert 101k

0

Entering edit mode

I like the rescue idea. Do you have (or know of) example code to do this in general? Or is it ad-hoc depending on what the ChIPed factor is?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by Ryan Dale 5.0k

3

Entering edit mode

there may be tools for this but we do it via handcrafted, homegrown organic code aged in oak barrels

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by Istvan Albert 101k

0

Entering edit mode

Sounds delicious! I'll have to craft some of my own.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by Ryan Dale 5.0k

0

Entering edit mode

Hi I also have a question regarding using peaks from pooled replicate or individual replicate. So, I was reading many blogs to find out the conclusion regarding choosing the standards and I found your above comment very logical and relevant , but I have a question regarding the common filtering the common peaks, common peaks should be filtered based on the peak summit or peak coordinates what according to you make sense?

ADD REPLY • link 7.8 years ago by #### ▴ 220

Ram · Answer 2 · 2014-09-23

This is a very good discussion.

I will add that the reproducibility of peaks is dependent on sequencing depth. i.e. sometimes the binding is there but the peaks are not being detected because there aren't enough reads covering them. So you might get something like a 30% overlap in peaks between replicates for a broad histone mark but 75% for a narrow histone mark using the same read depth. It is just that it takes more reads to call the marks in the broad case because the marks cover more of the genome.

Underpowered experiments just don't replicate very well.

These guys did some analyses to figure out if you are deep enough and they have some nice discussions of this:

Systematic evaluation of factors influencing chipseq fidelity

So if your main problem is that you don't have the depth to call peaks I would combine my samples to get the biggest set of peaks and then look in those regions for variability in coverage between replicates. There should be a balance between the amount of variability caused by sequencing depth (poisson noise) and biological noise. Which is higher will depend on sequencing depth relative to the amount of genome that the mark covers.

Ram · Answer 3 · 2014-09-16

4

Entering edit mode

10.2 years ago

Ryan Dale 5.0k

IDR is very clunky, but the final results do seem quite robust. I've only used it on 2 replicates.

I've had good results with PePr (paper, github), which is one of the few (only?) peak-callers I've seen built from the ground up with replicates in mind. The differential peak calling mode works quite well, too.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by Ryan Dale 5.0k

0

Entering edit mode

I did find IDR quite clunky indeed! I'm going to play with PePr and as @matted pointed out, MultiGPS. Thanks for sharing your experience!

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by dariober 15k

Ram · Answer 4 · 2014-09-16

2

Entering edit mode

10.2 years ago

matted 7.8k

Another peak caller that's designed to handle multiple experiments (and replicates) is MultiGPS. It tries to address some of the concerns that you have, namely that throwing all the reads together loses information and may do the wrong thing if certain nuisance parameters change between runs (e.g. fragmentation distribution or number of reads). It can handle multiple conditions and replicates with arbitrary designs that you can choose. Full disclosure: I'm on the paper...

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by matted 7.8k

0

Entering edit mode

Very interesting! Thanks for pointing it out, I did miss MultiGPS!

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by dariober 15k

0

Entering edit mode

Thanks, I missed this too. I'll try it out.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.2 years ago by Ryan Dale 5.0k

0

Entering edit mode

I missed multiGPS too. From a quickread through the paper, am I correct in understanding the underlying model is specifically geared at replicates of punctate or narrow peaks as opposed to broad peaks??

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by kate • 0

0

Entering edit mode

Yes, that's right. For multi-condition analysis of broad peaks you could check out

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by matted 7.8k