I've started dealing with ATAC-seq data and I have a couple of questions that hopefully some of you have pondered over, too.
My understanding is as follows:
- Tn5 will cut anywhere when there's not a nucleosome, therefore regions of generally open chromatin will result in smallish fragments between 40bp to ca. 100 bp -- I assume this is simply a function of the presence of certain motifs that the transposase prefers?
- fragments between 150-300 bp are representative of events where the transposase managed to cut at both sides of a single nucleosome (of course, this can go on as the transposase may also cut out two nucleosomes, resulting in fragments of 2x150bp + x bp of linker DNA)
All publications that I've seen so far show the characteristic distribution of fragment lengths, with high peaks for very short (< 150bp) and smaller bumps for fragment lengths that are representative of mono-, di-, tri-nucleosomes. In fact, in data I've looked at, we barely see any fragments greater than 300 bp which does not surprise me given Illumina's inherent preference for short fragments.
So, here are my questions:
1. What is the background signal in ATAC-seq?
Since closed chromatin will lead to larger fragments, these will never be seen in the same quantities as the short fragments even if they are there (and surely, the majority of the genome is, in fact, not nucleosome-free as the ATAC-seq histograms may suggest - or is it?). Have you seen ATAC-seq data where the entire genome is somewhat uniformly covered with strong enrichments around the promoters?
2. What is the peak calling meant to achieve?
I come from the ChIP-seq world where peak calling is your best shot at zooming into regions that are not just open chromatin, but actually binding sites of the transcription factor you were trying to precipitate. Peak callers like MACS try to understand what the background signal is (i.e., the majority of reads covering the most part of the genome) and then pinpoint regions that are at the extremes of that background model. For ATAC-seq, since I'm not sure what the background signal is supposed to be (since closed chromatin is definitely under-represented), what is the peak calling really meant for? And is MACS actually an appropriate means to that end given that there's no real uniform coverage?
3. Is ATAC-seq more similar to RNA-seq than to ChIP-seq?
Following these lines of thoughts, should one think of ATAC-seq really more in terms of RNA-seq analysis than of ChIP-seq analysis? After all, it seems to me as if ATAC-seq peaks may be equivalent to identifying "expressed genes" (because unexpressed genes are also usually missing from RNA-seq) and the analysis should really focus on the differential read counts between the same region in two samples. If that is the case, this opens a whole other can of worms (e.g. defining the regions, normalizing read counts, number of replicated samples etc.) that should probably be discussed in a different thread.
I appreciate any insights and critical comments!