Question: Signal-to-Noise Plots for NGS Data; Selection of Background Regions
gravatar for ATpoint
2.5 years ago by
ATpoint17k wrote:

Hello there,

I would like to create a signal-to-noise plot for ATAC-seq data (or any similar NGS technique) as shown below. It should center all TSS. I intend to extract the signal over the TSS in a 4kb window as signal and the signal over some random regions as background, and then divide these by each other for a signa-to-noise ratio per basepair or bin.

I was thinking about randomly selecting background regions like this: same total number of regions as the TSS, same length, similar GC-content, good mappability. Before I now start putting code together myself for that entire task, is there a tool that can do it or somewhat assist in it? Also, do you think extracting only one set of background regions is sufficient or should the process be repeated multiple times to get a better background representation (averaging signal of multiple sets)? I would appreciate if you could share your experiences.


EDIT (28.2.17): After neglecting this for a while, I now came up with this solution, which worked well for me:

-As reference regions, I took all refSeq TSS with a 500bp window around it and removed all regions, which either intersected with the standard ENCODE blacklist, a custom blacklist for ATAC-seq (Buenrostro 2015), or the complement of empirically mappable regions ( => resulting in 24533 regions

-As background, I extracted the same number of regions from hg19, using BEDtools shuffle, with the 24533 regions as -i, and the merge of the ENCODE/ATAC blacklists + the complement of the mappable regions as -excl (using the mappable regions as -incl is unusably slow in bedtools 2.26.0 as issued several times in GitHub, therefore I used a joint -excl list). Additionally, background regions were required to be on the same chromosome as the reference and without overlaps to other background regions (-chrom -noOverlapping). I did not further check the GC content to be similar between reference and background, but when using ~25000 regions, it probably does not really make any difference.

For plotting the read counts in a 4kb window around the center of the regions, I used ScoreMatrix() from the Bioconductor package genomation, producing a matrix, in which each column represents one of the 4000 nucleotides in the window. The readcounts (rows) can then be quickly averaged by the colMeans function or something appropriate from the matrixStats package. Dividing reference by background then gives the values to be plotted as "fold enrichment over input".


atac-seq chip-seq ngs • 1.5k views
ADD COMMENTlink modified 2.3 years ago • written 2.5 years ago by ATpoint17k
gravatar for Designerpot
2.5 years ago by
Designerpot0 wrote:

When I was doing this type of analysis, I would use the region just outside the 4kb window to normalize against. In other words, I would calculate the signal in a 8kb window around the TSS, and then divide your signal in the center 4kb window by the signal in the outer two 2kb windows.

The big advantage of this is that it's very easy to code - you don't have to look around the genome for GC-content, mappability and other stuff after removing TSS-regions from the genome. You already have the code you need.

That ease comes with a price, of course, namely that the outer 2kb windows may not be as well-behaved as you'd like. I'd deal with that by seeing how similar the outer windows are to the 4kb TSS-windows in terms of GC-content, mappability etc.

ADD COMMENTlink written 2.5 years ago by Designerpot0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 665 users visited in the last hour