Question: Why are there duplicate peaks in ENCODE ATAC-seq output?
0
gravatar for sswanson
25 days ago by
sswanson20
sswanson20 wrote:

I've installed and run the current ENCODE-DCC ATAC-seq pipeline from GitHub. I have 2 replicates of ATAC-seq data. When I run the pipeline, regardless of which (blacklist-filtered) peak file I look at (rep1, rep2, or pooled), about 6% of the peak loci are repeated (but with different scores). Most are repeated two or three times, but the worst-case has 7 entries!

Here's an example:

chr1    629086  630068  Peak_1  1000    .   5.22734 7335.73975  7327.66357  727
chr1    629086  630068  Peak_53 1000    .   1.69177 307.40005   301.90186   79
chr1    629086  630068  Peak_6  1000    .   3.27104 2558.00244  2551.49731  291

Does anyone have any idea what's going on here? & what I should do with these "extra" peaks?

atac-seq encode • 134 views
ADD COMMENTlink modified 25 days ago by ATpoint40k • written 25 days ago by sswanson20

We never figured this out - we just merged the peaks in the end, we don't use the scores after an initial filter anyway.

ADD REPLYlink written 25 days ago by i.sudbery9.4k

open an issue at the github page of the ATAC-seq pipeline

ADD REPLYlink written 25 days ago by Friederike6.4k
3
gravatar for ATpoint
25 days ago by
ATpoint40k
Germany
ATpoint40k wrote:

Could this be related to the call-summits option of macs2? Same intervals but different summits ($10)?

See https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/src/encode_task_macs2_atac.py#L64, that option seems to be activated.

From the manual of macs2:

--call-summits

MACS will now reanalyze the shape of signal profile (p or q-score depending on the cutoff setting) to deconvolve subpeaks within each peak called from the general procedure. It's highly recommended to detect adjacent binding events. While used, the output subpeaks of a big peak region will have the same peak boundaries, and different scores and peak summit positions.

You can simply merge the intervals as Ian suggests given that you do not care about the summit statistics which (from what I know) most people don't care about. If you are open to alternatives I recommend the Genrich peak caller, gives good results in my hand, but more cumbersome to use because you have to sort reads by name. It can accept replicates though which is nice. https://github.com/jsh58/Genrich

ADD COMMENTlink modified 25 days ago • written 25 days ago by ATpoint40k
2

Found this long-ago version of my question, specifically referencing MACS2 rather than ATAC-seq:

Why does MACS2 report multiple records for the same peak region?

ADD REPLYlink written 24 days ago by sswanson20

uhhh, that's a very hot lead, indeed!

ADD REPLYlink written 25 days ago by Friederike6.4k

Based on a private communication with a colleague, I believe that, "You are correct, sir!"

At the moment, I'm just trying to run the default Encode pipeline correctly, before I start messing about with custom peak-callers, read-trimmers, and aligners.

Thanks!

ADD REPLYlink written 24 days ago by sswanson20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1252 users visited in the last hour