I would like to create a signal-to-noise plot for ATAC-seq data (or any similar NGS technique) as shown below. It should center all TSS. I intend to extract the signal over the TSS in a 4kb window as signal and the signal over some random regions as background, and then divide these by each other for a signa-to-noise ratio per basepair or bin.
I was thinking about randomly selecting background regions like this: same total number of regions as the TSS, same length, similar GC-content, good mappability. Before I now start putting code together myself for that entire task, is there a tool that can do it or somewhat assist in it? Also, do you think extracting only one set of background regions is sufficient or should the process be repeated multiple times to get a better background representation (averaging signal of multiple sets)? I would appreciate if you could share your experiences.
EDIT (28.2.17): After neglecting this for a while, I now came up with this solution, which worked well for me:
-As reference regions, I took all refSeq TSS with a 500bp window around it and removed all regions, which either intersected with the standard ENCODE blacklist, a custom blacklist for ATAC-seq (Buenrostro 2015), or the complement of empirically mappable regions (https://www.encodeproject.org/comparative/chromatin/#mappability). => resulting in 24533 regions
-As background, I extracted the same number of regions from hg19, using BEDtools
shuffle, with the 24533 regions as -i, and the merge of the ENCODE/ATAC blacklists + the complement of the mappable regions as -excl (using the mappable regions as -incl is unusably slow in bedtools 2.26.0 as issued several times in GitHub, therefore I used a joint -excl list). Additionally, background regions were required to be on the same chromosome as the reference and without overlaps to other background regions (-chrom -noOverlapping). I did not further check the GC content to be similar between reference and background, but when using ~25000 regions, it probably does not really make any difference.
For plotting the read counts in a 4kb window around the center of the regions, I used
ScoreMatrix() from the Bioconductor package genomation, producing a matrix, in which each column represents one of the 4000 nucleotides in the window. The readcounts (rows) can then be quickly averaged by the
colMeans function or something appropriate from the matrixStats package. Dividing reference by background then gives the values to be plotted as "fold enrichment over input".