Question

How to define open or close gene in ATAC-seq?

0

Entering edit mode

2.4 years ago

gglim ▴ 140

May I ask how do you use your ATAC-seq data for post-alignment analysis?

Here's what I'm doing and I have a few problems:

For example, there are two groups of samples, WT and KO. In order to define the peak status, I merge all bam files (all replicates of two groups) and call a TOTAL peak using MACS2.
Then in R, WT and KO peak file is used to define peaks status by ChIPpeakAnno::findOverlappingPeaks, respectively. if there's overlap between WT and TOTAL, this TOTAL peak is defined as "Open in WT".
ChIPpeakAnno::annoPeaks do the annotation part with parameters (bindingType = "startSite", bindingRegion = c(-3000,3000)).

Problem: A gene can be annotated by many peaks, which show different status, eg. "OpenToClose", "CloseToOpen","BothOpen","NeitherOpen".

Does it make sense if I use these genes for further analysis? such as GO enrichment, their expression levels in RNAseq. If not, how to do it in a better way?

ChIPpeakAnno ATAC-seq MACS2 • 1.1k views

ADD COMMENT • link 2.4 years ago by gglim ▴ 140

score 2 · Answer 1 · 2021-11-29

In my opinion the only truely relevant region to define a gene as "open" or "closed" is the proximal promoter region which typically is approximated as the 500bp upstream of the annotated transcription start site(s). For a confident statement of peak status between conditions you would need to perform a proper differential analysis. For this you could look into the DiffBind or csaw packages, or alternatively merge all peaks, make a count matrix (samples are columns, regions are rows) and then plug this into differential analysis frameworks such as DESeq2. Just checking whether a peak is present in one condition but not in the other is a naive approach that will miss many peaks that are detectable in both conditions, yet with very different read counts, so you only probably get the very top differential peaks with such as approach.