Question

Chip-Seq merging peak files

3

Entering edit mode

7.3 years ago

tkygyn ▴ 30

Hi all,

I am very knew to all things Chip-seq. We have performed multiple experiments and now I have to analyze multiple files and was told to pretty much merge the replicates and use the mean of the distance for each gene.

Up until here I agreed with, but while I understand merging the replicates I was also told to merge broad and narrow peak files.

For all I've been reading this sounds like a terrible idea, but I'm the new person. If I'm correct (there is always the chance that I'm completely wrong) what arguments would be best what could I use as reference to support this position ?

Thank you

ChIP-Seq • 9.2k views

ADD COMMENT • link updated 7.3 years ago by apa@stowers ▴ 600 • written 7.3 years ago by tkygyn ▴ 30

4

Entering edit mode

7.3 years ago

mforde84 ★ 1.4k

Hi!

You can merge peaks from distinct biological replicates, though as Michele points out there's no real follow up analysis you can do after doing so. After that, you can only really describe the data.

I'd suggest looking closely at ENCODE TF pipeline (see links below). We are interested in reproducibility of peaks (whether narrow or broad), and I assume that's close to what you're being asked to do. We generate a biological replicate pool and psuedoreplicate conditions from the pool and individual biological replicates. We then call peaks, and perform a IDR analysis across all of these conditions.

This is the official ENCODE writeup page: https://sites.google.com/site/anshulkundaje/projects/idr

This is my github for a working pipeline deployed on an AWS like cloud environment: https://github.com/mforde84/ENCODE_TF_ChIP_pipeline

Have fun, ChIP is alot of fun to work with!

M

ADD COMMENT • link 7.3 years ago by mforde84 ★ 1.4k

1

Entering edit mode

7.3 years ago

apa@stowers ▴ 600

I'm not sure why you would be merging "narrow" and "broad" peak calls -- generally you are calling one type of peak or the other, only one of these is appropriate, depending on what it is you chipped, for instance transcription factor (narrow) or histone mark (broad). Unless perhaps you chipped an enzyme, say Pol2, which can have mixed behaviors that may require multiple peak calling strategies.

I also definitely recommend using IDR, and I use it all the time. BUT: be sure to use version 2. Version 1 had serious bugs!

ADD COMMENT • link 7.3 years ago by apa@stowers ▴ 600

score 6 · Accepted Answer · 2017-01-05

First, are these A) technical or B) biological replicates? That is, the same biological sample run several times with the same antibody (same lot also if polyclonal) protocol, or different biological samples run the same way with the same protocol?

If it is A it may be reasonable to merge them for some analyses, such as just annotating peaks. I would merge the bam alignment files and then do the calls versus merging the calls.

However, first you have analyze your replicates to check they they all perform the same. We did a lot of performance comparisons here: https://epigeneticsandchromatin.biomedcentral.com/articles/10.1186/s13072-016-0100-6

You can steal those ideas, especially using the ENCODE segmentation tracks if it's human and they have tracks for something like your cell type. But just counting the reads in bins and then doing a correlation is pretty informative.

But even in our data, and we used a robot and do it a lot, one of our technical replicates behaved strangely. See supplemental figure S6.

If it is B, biological replicates, you almost certainly don't want to merge them. You will lose your information about biological variance is present. If you are looking at something like differential peaks between conditions DESeq and really all reputable programs will want some sort of replicates, almost always biological. In general, if you want to compute a p value on anything you need separate replicates (not merged).

If you are just annotating peaks you don't need a p value.