Understanding ATAC-seq data
2
17
Entering edit mode
4.2 years ago

Hi guys,

I've started dealing with ATAC-seq data and I have a couple of questions that hopefully some of you have pondered over, too.

My understanding is as follows:

• Tn5 will cut anywhere when there's not a nucleosome, therefore regions of generally open chromatin will result in smallish fragments between 40bp to ca. 100 bp -- I assume this is simply a function of the presence of certain motifs that the transposase prefers?
• fragments between 150-300 bp are representative of events where the transposase managed to cut at both sides of a single nucleosome (of course, this can go on as the transposase may also cut out two nucleosomes, resulting in fragments of 2x150bp + x bp of linker DNA)

All publications that I've seen so far show the characteristic distribution of fragment lengths, with high peaks for very short (< 150bp) and smaller bumps for fragment lengths that are representative of mono-, di-, tri-nucleosomes. In fact, in data I've looked at, we barely see any fragments greater than 300 bp which does not surprise me given Illumina's inherent preference for short fragments.

So, here are my questions:

1. What is the background signal in ATAC-seq?

Since closed chromatin will lead to larger fragments, these will never be seen in the same quantities as the short fragments even if they are there (and surely, the majority of the genome is, in fact, not nucleosome-free as the ATAC-seq histograms may suggest - or is it?). Have you seen ATAC-seq data where the entire genome is somewhat uniformly covered with strong enrichments around the promoters?

2. What is the peak calling meant to achieve?

I come from the ChIP-seq world where peak calling is your best shot at zooming into regions that are not just open chromatin, but actually binding sites of the transcription factor you were trying to precipitate. Peak callers like MACS try to understand what the background signal is (i.e., the majority of reads covering the most part of the genome) and then pinpoint regions that are at the extremes of that background model. For ATAC-seq, since I'm not sure what the background signal is supposed to be (since closed chromatin is definitely under-represented), what is the peak calling really meant for? And is MACS actually an appropriate means to that end given that there's no real uniform coverage?

3. Is ATAC-seq more similar to RNA-seq than to ChIP-seq?

Following these lines of thoughts, should one think of ATAC-seq really more in terms of RNA-seq analysis than of ChIP-seq analysis? After all, it seems to me as if ATAC-seq peaks may be equivalent to identifying "expressed genes" (because unexpressed genes are also usually missing from RNA-seq) and the analysis should really focus on the differential read counts between the same region in two samples. If that is the case, this opens a whole other can of worms (e.g. defining the regions, normalizing read counts, number of replicated samples etc.) that should probably be discussed in a different thread.

I appreciate any insights and critical comments!

Cheers,

Friederike

sequencing ATAC-seq peak calling • 18k views
6
Entering edit mode
4.2 years ago
jotan ★ 1.2k

ATAC-seq provides a map of DNase hypersensitive sites. The analysis is similar to ChIP-seq. The interpretation is a hybrid between ChIP-seq and RNA-seq.

Q1. The background signal in ATAC-seq represents the same thing as the background signal in ChIP-seq. Random stochastic noise.

Q2. This means the peak calling is also the same thing as ChIP-seq. As far as the analysis goes, peak calling allows discrimination between signal and noise.

Q3. It depends on what you're trying to do. There are multiple ways to use ATAC-seq data. Most rely on the assumption that open chromatin represents gene expression and/or meaningful binding. In the original paper, the authors performed ATAC-seq in a small number of primary cells (not possible to ChIP-seq on such a small number of cells). Then they overlapped a composite of transcription factors ChIP-seq obtained from other samples and experiments, and used this overlap to predict TF binding in the ATAC-seq'd small sample.

So the workflow was:

1) Use ATAC-seq to identify regions of open chromatin. 2) Overlap with existing ChIP-seq datasets 3) Predict TF binding in ATAC-seq sample. (Based on the assumption that TF binding will only occur at sites of open chromatin).

3
Entering edit mode

Thanks for sharing!

The background signal in ATAC-seq represents the same thing as the background signal in ChIP-seq. Random stochastic noise.

Where is that coming from though? In ChIP-seq, DNA ist most commonly fragmented using sonication and fragments are size selected prior to sequencing. While this is not completely random, we tend to see virtually the entire genome covered, which indicates to me that the sonication eventually manages to break even nucleosomal DNA apart. For ATAC-seq there's no size selection and my perhaps naive impression was that the transposase is not really going to unravel nucleosomal DNA, so while it can cut in closed regions, the resulting fragments will become so long that they will hardly be sequenced. Are you saying that, at least for bulk ATAC-seq, the transposase seems to be able to integrate in generally closed chromatin regions (that may be open stochastically in individual cells), therefore generating short fragments from closed regions that will show up in the sequenced reads?

I can see how your workflow makes total sense, in my case however, people are not necessarily interested in specific TF, but just want to see whether their experimental perturbations lead to changes in chromatin accessibility. In that regard I would also be interested to know whether you think that actual changes in the peak height (after somewhat accounting for differences in sequencing depth) are meaningful (in ChIP-seq, I would be very hesitant to do so because the enrichment depends on so many technical factors). Now that I'm thinking about it - why _are_ the promoters (and enhancers) so dramatically enriched anyway? Does that imply that the gene bodies are never as "open" as the promoters although they need to accommodate the entire transcription machinery?

2
Entering edit mode
4.2 years ago
Charles Plessy ★ 2.7k

Regarding the background, in single-cell ATAC-seq libraries I have seen some cells displaying very broad regions (dozens of loci) whith dense and uniform coverage. I do not know if it has biological meaning or if the cells in question were starting to die. In any case, one can suspect that in bulk ATAC-seq libraries, this would create some uniform background that is not related to the usual peaks that we see near promoters and elsewhere. In addition, I have seen regions where the aligner (bwa sampe in my case) was mapping dozens if not hundreds of pairs, which is obviously incorrect in paired-end signle-cell data (after PCR deduplication), where we expect a density of coverage between 0 and 4. Thus, one can also suspect that in bulk libraries, some peaks are mapping artifacts.

Edit: I need to down-tune the statements above, which were based on a run made with cells that we supspect to be damaged. In higher-quality runs, broad regions of uniform coverage are rare (upon visual inspection), see my reply below. This still may account for noise in bulk libraries. Also, regions with over-coverage do not have hundreds of pairs, but still definitely more than 4. Typical regions are rRNA repeats.

1
Entering edit mode

in single-cell ATAC-seq libraries I have seen some cells displaying very broad regions (dozens of loci) with dense and uniform coverage

and what would the fragment sizes of the corresponding reads be? your interpretation is that these are fairly broad regions of open chromatin?

1
Entering edit mode

Jut looking by eye, there are many fragments that can be as large as 500 bp or even longer. In one of our best-quality runs, there is only one cell that displays this broad coverage, but on extremely large portions of the genome (which may explain the large size of the fragments, as the quantity of Tn5 may be limiting if suddenly there are too many open regions). Please also see my edit in my comment that I needed to correct (sorry for this).

1
Entering edit mode

are these normal cells or cancer cells? just thinking whether these, too, could be mapping artifacts due to rearrangements or the like.

have you ever compared different conditions (e.g. drug treatments) with scATAC-seq? as I asked above, I am also trying to understand whether the height of the peak in bulk ATAC-seq can be used as a proxy for how open the region is. in bulk ATAC-seq I would expect that the signal height is somewhat correlated to the number of cells that have open chromatin - is that confirmed by scATAC-seq?

1
Entering edit mode

At the moment I am still at the stage of technical controls, as I had some worries about possible cross-contaminations (which would impact the answer to your question as well). The data I am takling about in this thread to C1 runs loaded with a human-mouse mixture of hepatocyte cell lines. I took the time to produced these control runs because to my knowledge there is no such data available yet in the sequence databanks. I hope to open it soon, before publication and I would be pleased if you could have a look as well ! I will post about it once it is done, please do not hesitate to ping me if nothing happens within 10 days.

1
Entering edit mode

that sounds like a great data set!

2
Entering edit mode