Question

Paired-end ChIP-Seq Peak Calling

1

Entering edit mode

9.9 years ago

simon.pearce ▴ 20

Hi,

I have some CHiP-Seq data for transcription factor binding in Arabidopsis Thaliana (the model plant). The data is paired-end, with two replicates and a control (total input). I have trimmed and aligned the data and now have sorted, indexed BAM files (or BED files). Reads are 100bp each, with average DNA fragment sizes between 300-450 depending on the sample.

Viewing the reads in IGV I can see some regions (for two genes that we think are targets of the TF) that are highly enriched across the whole gene (rather than the promoter region), as well as various bits of noise where both Input and control have large peaks.

When I try using MACS, I get a huge list of peaks that include those two genes. But when I look at these other "peaks" in IGV, the plots are almost exactly the same shape between the ChIP and Input. They are sometimes different sizes (presumably due to read count), but on a visual inspection they look almost identical. My call to MACS is something like:

macs -t TF_3ul_P_sorted.bam -c TF_Input_P_sorted.bam -f BAM -g 111755668 -n TF_3ul -B -s 100 -S --bw=350

I've been looking for different Peak calling algorithms that are designed for paired-end reads and I seem to be struggling. A lot of the possible options then tell me they only take paired end data in the form of ELAND, whatever that is. Or I can't manage to successfully install them. I'm using a Windows 7 machine with a VirtualBox running Ubuntu. My Linux skills are fairly basic, and this is causing problems with installation of some of the tools that I find. Or they only work on Human/Mouse data, not Arabidopsis, which is completely useless to me.

Can anyone suggest a peak-calling algorithm that takes paired-end data and successfully removes peaks that are the same shape in the Input control sample?

Thanks!

ChIP-Seq • 6.5k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.9 years ago by simon.pearce ▴ 20

0

Entering edit mode

what do you mean by shape? the enrichment relates to the coverage not the shapes. If the coverages are substantially higher then the peaks are valid.

ADD REPLY • link 9.9 years ago by Istvan Albert 100k

0

Entering edit mode

The coverage in these peaks is often lower in the ChIP sample than the Input, prior to total read normalisation at least. The image that I've (hopefully) attached in this comment shows an example of such a 'peak', with the ChIP sample as the first row and the Input control as the second row. The scales go up to 28,985 and 26,471 respectively (so slightly higher in the ChIP).

ADD REPLY • link updated 2.6 years ago by Ram 43k • written 9.9 years ago by simon.pearce ▴ 20

0

Entering edit mode

well the data for this region looks identical, this is not an issue of peak detection anymore, there is no differential coverage over this area so no peaks should be called here.

If you think that there should be differential expression then it might be a sample mislabeling or other error.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Istvan Albert 100k

0

Entering edit mode

I agree with you, this looks identical and therefore not a peak. My problem is that I'm getting features such as this being called as a peak.

ADD REPLY • link 9.9 years ago by simon.pearce ▴ 20

Ram · Answer 1 · 2014-05-29

1

Entering edit mode

9.9 years ago

Chris Fields ★ 2.2k

MACS v2 and SiPES might be your best bet (looks as if you are running MACS v1). You may want to look at recent commits for MACS v2 to get a more recent release, as they aren't tagged. Are you expecting these to be broad or punctate marks? MACS v2 is supposed to handle both. Also, what are the FDR values you get back from the original run?

You could also try downsampling so that the input is always higher. The read enrichment in the input is troubling but that is seen in human samples as well (can be open chromatin or regions of known copy number variation). Do they have a 'blacklist' set for Arabidopsis?

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 9.9 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

I don't know of a blacklist set for Arabidopsis, but I thought that human had this kind of areas with enriched data. What I may end up doing is masking my data for anything that is highly expressed in the Input before doing the peak calling.

I'm expecting mostly sharp peaks for promoter binding, although there are a couple of genes which are extremely enriched for reads in the sample (10000+) across the entire gene (which I don't know if that is usual, but they are genes that we were fairly sure that this TF regulates from prior knowledge).

I had tried to install MACS v2.0.10 but was getting an error message that I couldn't work out how to fix immediately. Just fixed it now and will see if that gives me any better results after the weekend.

ADD REPLY • link 9.9 years ago by simon.pearce ▴ 20

Ram · Answer 2 · 2014-08-03

You may want to try GEM. It specifically models the shapes (i.e. the spatial distribution of the reads) of the binding events. It uses binomial test of IP vs Control data for statistical significance testing. And it filters events based on the shape of event (whether it is dissimilar to expected ChIP-seq shape). In addition, it is a Java software, so no installation is required. It can also take paired-end BAM files as input.