Question

ENCODE ATAC-seq pipeline peak calling

1

Entering edit mode

8.2 years ago

igor 13k

I am looking at the ENCODE ATAC-seq pipeline: https://www.encodeproject.org/pipelines/ENCPL035XIO/

They have two different steps:

"call nuclease accessible regions using FSeq" (in PDF) or "open chromatin region identification" (on diagram)
"call nuclease accessible peaks using Homer" (in PDF) or "peak calling" (on diagram)

Regardless of the tool used, what is the difference between "regions" and "peaks"? I would think those are the same thing (in this context, a set of loci where the reads accumulate).

atac-seq • 8.6k views

ADD COMMENT • link updated 7.0 years ago by Simply Bioinformatics ▴ 200 • written 8.2 years ago by igor 13k

0

Entering edit mode

What I understood is FSeq is to generate the signal file ( for ucsc browsers) and HOMER is for peak calling ( e.g for differential peak analysis ).

ADD REPLY • link 8.2 years ago by GouthamAtla 12k

0

Entering edit mode

By signal file, do you mean a wiggle file? If it's just that, how is it different than a generic bigWig from a BAM file?

ADD REPLY • link 8.2 years ago by igor 13k

1

Entering edit mode

Its not just a normalised counts at each base.

From F-Seq website:

To intuitively summarize and display individual sequence data as an accurate and interpretable signal, we developed F-Seq, a software package that generates a continuous tag sequence density estimation allowing identification of biologically meaningful sites whose output can be displayed directly in the UCSC Genome Browser

As I said before, its "What I understand"

ADD REPLY • link 8.2 years ago by GouthamAtla 12k

0

Entering edit mode

Do you know what this output actually looks like?

ADD REPLY • link 8.2 years ago by igor 13k

0

Entering edit mode

7.0 years ago

Simply Bioinformatics ▴ 200

This pipeline is currently deprecated and been replaced by this one:

https://github.com/kundajelab/atac_dnase_pipelines

ADD COMMENT • link 7.0 years ago by Simply Bioinformatics ▴ 200

score 6 · Accepted Answer · 2016-09-12

6

Entering edit mode

8.1 years ago

igor 13k

I received a very helpful clarification after emailing ENCODE directly:

Nuclease accessible regions tend to be long, e.g. 10 kb or longer. This was clear even in the early papers on DNase sensitivity (mid-to-late 1970's; Groudine and Weintraub). These accessible regions can contain entire genes or even clusters of genes. Within the nuclease accessible regions, some localized DNA segments are so readily cleaved that double-strand breaks are generated at that position in a substantial fraction of the cells in the population. These are the DNase-hypersensitive sites (DHSs) first mapped by Carl Wu (late 1970's). I see the Fseq "regions" as the equivalent of nuclease accessible regions, and the Homer "peals" as the equivalent of DHSs.

If you look at the signal track for DNase-seq or ATAC-seq, you see broad regions of signal that are significantly above the background. Within those regions, you see localized peaks, often many peaks per region. Fseq calls the broad regions, and we use Homer to call the localized peaks. MACs can be used for peak calling as well, Anshul Kundaje is doing that. You can see similar analyses in the work from John Stamatoyannopoulos for DNase-seq. I think Hotspots are like regions, and DHSs are peaks confined to a defined length.

ADD COMMENT • link 8.1 years ago by igor 13k

0

Entering edit mode

Might be worth looking into the Danpos2 suite and/or iNPS for peak calling of DNase-seq data if what i'm reading here makes sense. Those peak callers are for MNase-seq data, but it seems that it may apply in this case.

ADD REPLY • link 8.1 years ago by Sinji ★ 3.2k

0

Entering edit mode

I've never worked with MNase, but shouldn't all those peaks be ~150bp (size of a single nucleosome)?

enter image description here

ADD REPLY • link 8.1 years ago by igor 13k

0

Entering edit mode

Yes, and now, coincidentally, we realise that this size (~154bp I believe) also corresponds generally to the mean fragment length of circulating free DNA in blood plasma. In fact, latest research indicates that we can analyse nucleosome positioning and circulating free DNA and infer tissue of origin of the cfDNA. This has utility in the identification, for example, of the tissue of origin of circulating tumour DNA fragments, and thus in the identification of which organ may be showing early signs of cancer.