Question

Why are there duplicate peaks in ENCODE ATAC-seq output?

0

Entering edit mode

5.1 years ago

sswanson ▴ 20

I've installed and run the current ENCODE-DCC ATAC-seq pipeline from GitHub. I have 2 replicates of ATAC-seq data. When I run the pipeline, regardless of which (blacklist-filtered) peak file I look at (rep1, rep2, or pooled), about 6% of the peak loci are repeated (but with different scores). Most are repeated two or three times, but the worst-case has 7 entries!

Here's an example:

chr1    629086  630068  Peak_1  1000    .   5.22734 7335.73975  7327.66357  727
chr1    629086  630068  Peak_53 1000    .   1.69177 307.40005   301.90186   79
chr1    629086  630068  Peak_6  1000    .   3.27104 2558.00244  2551.49731  291

Does anyone have any idea what's going on here? & what I should do with these "extra" peaks?

atac-seq encode • 2.7k views

ADD COMMENT • link updated 5.1 years ago by ATpoint 89k • written 5.1 years ago by sswanson ▴ 20

0

Entering edit mode

We never figured this out - we just merged the peaks in the end, we don't use the scores after an initial filter anyway.

ADD REPLY • link 5.1 years ago by i.sudbery 22k

0

Entering edit mode

open an issue at the github page of the ATAC-seq pipeline

ADD REPLY • link 5.1 years ago by Friederike 9.0k

score 3 · Accepted Answer · 2020-10-01

3

Entering edit mode

5.1 years ago

ATpoint 89k

Could this be related to the call-summits option of macs2? Same intervals but different summits ($10)?

See https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/src/encode_task_macs2_atac.py#L64, that option seems to be activated.

From the manual of macs2:

--call-summits

MACS will now reanalyze the shape of signal profile (p or q-score depending on the cutoff setting) to deconvolve subpeaks within each peak called from the general procedure. It's highly recommended to detect adjacent binding events. While used, the output subpeaks of a big peak region will have the same peak boundaries, and different scores and peak summit positions.

You can simply merge the intervals as Ian suggests given that you do not care about the summit statistics which (from what I know) most people don't care about. If you are open to alternatives I recommend the Genrich peak caller, gives good results in my hand, but more cumbersome to use because you have to sort reads by name. It can accept replicates though which is nice. https://github.com/jsh58/Genrich

ADD COMMENT • link 5.1 years ago by ATpoint 89k

2

Entering edit mode

Found this long-ago version of my question, specifically referencing MACS2 rather than ATAC-seq:

Why does MACS2 report multiple records for the same peak region?

ADD REPLY • link 5.1 years ago by sswanson ▴ 20

0

Entering edit mode

uhhh, that's a very hot lead, indeed!

ADD REPLY • link 5.1 years ago by Friederike 9.0k

0

Entering edit mode

Based on a private communication with a colleague, I believe that, "You are correct, sir!"

At the moment, I'm just trying to run the default Encode pipeline correctly, before I start messing about with custom peak-callers, read-trimmers, and aligners.

Thanks!

ADD REPLY • link 5.1 years ago by sswanson ▴ 20