Question: ATAC-seq sample normalization
gravatar for Flo
14 months ago by
Flo100 wrote:

Hi everyone, first of all Iam a biologist so sorry for the (potentially) very basic question.

We performed ATAC-seq (treatment vs. control in duplicates) and processed the data through the Encode pipeline. The QC (especially the TSS enrichment) shows for both control samples a excellent result whereas both treatment samples are borderline ( but not so bad to trash it). Also the visual inspection of the peaks looks okay (very similar peak pattern between control and treatment). However we noticed that the peaks in the treatment samples are overall smaller compare to the control (bigwig TPM normalized) so we concluded that we have a potential enrichment bias. Since we do not knock down any global factor it is most likely a technical artefact. We continued to identify differential accessible regions using the csaw package ("trended" normalization). The resulting regions are highly enriched for binding motives of biological meaningful transcription factors (identified with HOMER) and also close to relevant genes.

So far so good... however now I have problems to visualise those regions. Since the TPM normalization of the bigwig files does not account for the global enrichment bias (correct ?) the plotted heatmaps of the differential accessible regions do not reflect the result from csaw (control has always higher signal compare to treatment). Please correct me when Iam wrong but in this case we could perform instead of the TPM normalization a quantile normalization right ? Is there a easy way to do this either with bam or bigwig files ? Otherwise I could get a count matrix from deeptools and feed it into the CQN package. However the cqn function requires a covariate and to be honest Iam not sure what this is in my data (sequencing depths ?). After the normalization I could convert the resulting file back into a bigwig file and us it for plotting.

Thanks for any suggestions.

ADD COMMENTlink modified 6 months ago by ATpoint45k • written 14 months ago by Flo100
gravatar for ATpoint
14 months ago by
ATpoint45k wrote:

What you describe seems to be a difference in signal-to-noise ratio which is not uncommon.
This is where more elaborate normalization methods such as TMM from edgeR or RLE from DESeq2 come into play.

See the following links on why these methods are superior. The videos talk about RNA-seq but simply consider genes as "regions" in any assay such as ATAC-seq or ChIP-seq:

Also, this is where more simple methods such as TPM/RPKM/FPKM which only correct for total library read count
(=sequencing depth) typically make limited sense as they fail to correct for the systematic differences in library composition due to the differences in either the peak landscape or signal/noise ratio. I cannot comment on Quantile Normalization as I have no experience with it but I had good results with both TMM and RLE. Therefore my recommendation to routinely use something like edgeR to calculate scaling factors which you then use to divide the column 4 of your bedGraph/bigwig by.

BedGraph/Bigwig creation:

Among other tools this could be done by bamCoverage from deeptools or genomecov from bedtools. Both tools start from BAM files and offer plenty of options for customization.

Scaling Factors:

Here is a code suggestion in R to calculate scaling factors to scale the bedGraph/bigwig files. It assumes you have already created a count matrix with rows = regions and columns = samples, e.g. based on a list of peaks called from the underlying experiment. This could be done with tools such as macs2 or Genrich followed by featureCounts. It is assumed that the matrix of raw counts is now loaded in R as object raw.counts:

## edgeR:: calcNormFactors
NormFactor <- calcNormFactors(object = raw.counts, method = "TMM")

## if you prefer to use the DESeq2 strategy use method="RLE" instead

## raw library size:
LibSize <- colSums(raw.counts)

## calculate size factors:
SizeFactors <- NormFactor * LibSize / 1000000

## Reciprocal, please read section below:   
SizeFactors.Reciprocal <- 1/SizeFactors

SizeFactors will now contain a factor for every sample which can be used to divide the 4th colun of a bedGraph/bigwig by. Both the aforementioned tools (bamCoverage and genomecov) have options though to directly scale the output by a factor (--scaleFactor or --scale respectively). !! Note though that these options will multiply rather than divide the 4th column with the factor, therefore you would need to provide the reciprocal as mentioned in the code chunk above.


Here is a picture I made a while ago addressing exactly this problem. It shows two biological replicates with different signal-to-noise ratios (replicate two has worse SN therefore overall lower peak counts as many reads fall into non-peak regions) for a region which I know is not differential. Using CPM (so per-million scaling based on only the read counts) shown in the top panel fails to correct for the systematic difference while TMM on the counts in peak regions (bottom panel) manages to correct for it. The middle panel is a bin-based normalization with edgeR suggested in the csaw package which might in certain situations be beneficial if library composition is very different between samples (check csaw vignette for details if you want), but lets not further discuss this here as it does not add to the topic right now.

enter image description here

Bigwigs in R:

If you want to visualize bigwigs in R one might want to check this handy tool that was recently posted here which seems to be easy to use yet powerful and visually appealing:

trackplot: Fast and minimal dependency standalone R script to generate IGV style locus tracks from bigWig files

ADD COMMENTlink modified 12 weeks ago • written 14 months ago by ATpoint45k

Sorry I have to come back again with two question regarding the bin based normalization (I guess you mean the loess-based normalization right ?). Since I using also this method to get my differential peaks which works quite well I would also like to use this method to scale my bedgraph/bigwig files just to keep everything the same across the analysis (and to avoid any nasty reviewer comment ;) ). However I have two questions:

  1. Before running normOffsets its recommended to filter windows which is just fine in order to identify differential peaks however for a signal track I want to scale all windows. Therefore should I perform normOffsets with all windows (so dont perform the filterWindows function) or should I calculated the scaling factors with the filtered file and set the scaling factor to "1" for all bins which got kicked out in filterWindows (although this seems to be a bit arbitrary) ?

  2. After calculating the scaling factor which is the best way to use it ? Make a bedgraph file using deeptools with same bin sizes and then multiply each row with the scaling factor calculated by normOffsets ?

Thanks again, Flo

ADD REPLYlink written 13 months ago by Flo100

Sorry for the super late reply, I missed the comment.

I do not really understand 1). What I do (in the rare case I really use bin-based normalization) is simply to use the full count matrix based on the 10kb windows (generated with bedtools makewindows and then featureCounts and feed this into the above code snipped. I do not use the code provided in the csaw manual but create the countmatrix externally.

For 2) you have different options. The simplest is to feed to norm. factors into deeptools bamCoverage --scaleFactor as described above. Alternatively you can make a bedGraph and divide the column $4 (so the coverages) by the norm. factor. Here is a Bioc thread where I asked the csaw developer about this to be sure I got it right:

ADD REPLYlink written 11 months ago by ATpoint45k

Hello, Thank you for your reply, it really helped me, I also have a question for the ChIP-seq data, I have two bam file (one is sample, the other is input), how can I normalize the data? how can I get the scale factor? May I get the bw file through bamCompare first ? and then get the rawcount file and use your TMM method ? or other correct way?

ADD REPLYlink written 11 months ago by zhuobaowen10

Input and IP samples are fundamentally different towards library composition and violate the assumptions of the normalization. I would therefore not try to process them in the described way. I have no hands-on experience with this as I never use inputs beyond the peak calling step. deeptools implements the SES normalization for your scenario afaik, maybe check their documentation.

ADD REPLYlink written 11 months ago by ATpoint45k

OKļ¼Œthank you very much

ADD REPLYlink written 11 months ago by zhuobaowen10

Hello, I have got the scale factor according to your method, but I'm not sure which is right. can you help me to judge which is right? thank you! here is my code

bamCoverage --bam -o --binSize 50 --scaleFactor 0.04000290 --numberOfProcessors 4

bamCoverage --bam -o --binSize 50 --normalizeUsing RPGC --effectiveGenomeSize 2864785220 --scaleFactor 0.04000290 --numberOfProcessors 4
ADD REPLYlink modified 11 months ago • written 11 months ago by zhuobaowen10

I use the first approach.

ADD REPLYlink written 11 months ago by ATpoint45k

you are right, thank you so much!

ADD REPLYlink written 11 months ago by zhuobaowen10

Impressive results, @ATpoint! Can we combine this approach with ERCC spike-in? I'd like to normalise RNA-seq bigwigs. Many thanks.

ADD REPLYlink written 10 months ago by horvath.attila44440

I guess you can: How does edgeR do ERCC spike-in normalization?

ADD REPLYlink written 6 months ago by ATpoint45k

This is great! How would I make a standard count matrix? Using featureCounts?

ADD REPLYlink written 6 months ago by hkitano0

I use featureCounts indeed.

ADD REPLYlink written 6 months ago by ATpoint45k
gravatar for Malcolm.Cook
13 months ago by
kansas, usa
Malcolm.Cook1.2k wrote:

If you want to use quantile normalization with csaw, try this:

library(preprocessCore)       # which implements normalize.quantiles
normOffsetsNQ<- function (object, ..., = "counts") {
  mat <- assay(object, i =, withDimnames = FALSE)
  mat.cpm <- edgeR::cpm(mat, log = TRUE, ...)
  assay(object, "offset") <- (NQ - mat.cpm)  / log2(exp(1))

...and use it instead of normOffsets in your csaw recipe.

To visualize the effect of normOffsetsNQ, you are best off to visualize the logFC of each group (you do have replicates, right?) for each contrast of interest at each window. Do this after performing the glmQLFTest step in your recipe, but before mergeResults. A most effective visualization for IGV genome browser is using an .igv format file with rows for each filtered window and columns of logFC for each contrast displayed as a heatmap.

Like this

This IGV snapshot of ATAC-Seq results:

  • displays .bw as line-plots which are overlayed by group using the brand new "Tracks > Overlay > Overlay by > group" allowing to eye-ball the within and between group variances quite nicely.
  • shows heatmap of logFC in each window at each contrast of interest (in my case, the first track is contrasted with each of the remaining 5 tracks).
  • shows a .bed file for each merged region at each contrast of interest.

I think it is a fool's errand to try and create a view of your TPM scaled bigwigs correspondingly quantile normalized. Alas perhaps some nasty reviewers will take you for a fool.

Good luck!

BTW - I intend soon to offer a pull request to the csaw project for this function.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Malcolm.Cook1.2k

Our of interest, is there a reference for normalizing per-million scaled data with QN rather than raw counts? From what I understand, e.g. here, one starts typically from integers, no? Or doesn't it make a difference? Not a statistician myself at all as you can see, so correct me if wrong.

ADD REPLYlink modified 13 months ago • written 13 months ago by ATpoint45k

normOffsetsNQ as written above actually normalizes the log of cpm (per the log=TRUE on the edgeR::cpm call). It then computes the offsets and converts them back into "natural base". In this I am following the advice of one of the author of csaw, Aaron Lun, private communication.

I am interested in references to argue for this approach. Here is an 2005 opinion from the other author of csaw, Gordon Smyth "There exist no clear results on whether it is best to carry out quantile normalization on the raw or log scale ..." discussed in the context of micro-array analysis.

ADD REPLYlink written 13 months ago by Malcolm.Cook1.2k

I tired to use rawcount once but later resorted to HOMER for all the analysis, I want to know how do you get raw counts from ATAC seq data ? I tried deeptools but it takes a bit of time not sure why...apart from deeptools I tired to use feature count but not able to do it.It is bit more complicated than rna seq where I give bam file as input and get raw counts. In case or ATAC or chip seq I have to give a saf format etc.So anyother methods for taking out raw counts which can be easily used downstream with library like deseq2 for differential accessibility analysis .

DIffbind I tried it uses lots of resources and time

ADD REPLYlink written 13 months ago by krushnach80890

It is actually quite simple to use featurecounts once you figure out how, see my comment here on how to do that: It is blazingly fast due to multithreading.

C: ATAC-seq DE analysis

ADD REPLYlink written 13 months ago by ATpoint45k

I asked my doubt on that thread ..

ADD REPLYlink written 13 months ago by krushnach80890

This discussion is primarily about the interplay of normalization and visualization in the context of the tool csaw which I am now using for ATAC-seq and find quite performant and flexible and all-in-one I recommend you read the paper & manual.

ADD REPLYlink written 13 months ago by Malcolm.Cook1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2197 users visited in the last hour