Question: ATAC-seq sample normalization
7
gravatar for Flo
10 months ago by
Flo70
Flo70 wrote:

Hi everyone, first of all Iam a biologist so sorry for the (potentially) very basic question.

We performed ATAC-seq (treatment vs. control in duplicates) and processed the data through the Encode pipeline. The QC (especially the TSS enrichment) shows for both control samples a excellent result whereas both treatment samples are borderline ( but not so bad to trash it). Also the visual inspection of the peaks looks okay (very similar peak pattern between control and treatment). However we noticed that the peaks in the treatment samples are overall smaller compare to the control (bigwig TPM normalized) so we concluded that we have a potential enrichment bias. Since we do not knock down any global factor it is most likely a technical artefact. We continued to identify differential accessible regions using the csaw package ("trended" normalization). The resulting regions are highly enriched for binding motives of biological meaningful transcription factors (identified with HOMER) and also close to relevant genes.

So far so good... however now I have problems to visualise those regions. Since the TPM normalization of the bigwig files does not account for the global enrichment bias (correct ?) the plotted heatmaps of the differential accessible regions do not reflect the result from csaw (control has always higher signal compare to treatment). Please correct me when Iam wrong but in this case we could perform instead of the TPM normalization a quantile normalization right ? Is there a easy way to do this either with bam or bigwig files ? Otherwise I could get a count matrix from deeptools and feed it into the CQN package. However the cqn function requires a covariate and to be honest Iam not sure what this is in my data (sequencing depths ?). After the normalization I could convert the resulting file back into a bigwig file and us it for plotting.

Thanks for any suggestions.

ADD COMMENTlink modified 8 weeks ago by ATpoint40k • written 10 months ago by Flo70

Sorry for my late reply but I finally had time to try your approach and it worked :). Thanks a lot for your suggestion ! I will nevertheless try also the bin approach but so far the results look already good :).

ADD REPLYlink modified 9 months ago • written 9 months ago by Flo70

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink written 9 months ago by genomax91k

Glad to help. I always try both approaches in parallel and then check both on the genome browser and using MA-plots which makes more sense.

ADD REPLYlink written 9 months ago by ATpoint40k
19
gravatar for ATpoint
10 months ago by
ATpoint40k
Germany
ATpoint40k wrote:

What you describe seems to be a difference in signal-to-noise ratio which is not uncommon.
This is where elaborate normalization methods such as TMM from edgeR or RLE from DESeq2 come into play.

See the following links on why these methods are superior. The videos talk about RNA-seq but simply consider genes as "regions" in any assay such as ATAC-seq or ChIP-seq:

Also, this is where more simplistic methods such as TPM/RPKM/FPKM which only correct for read count
(=sequencing depth) typically make little sense as they fail to correct for the systematic differences in library composition due to the differences in either the peak landscape or signal/noise ratio. I cannot comment on Quantile Normalization as I have no experience with it but I had good results with both TMM and RLE. Therefore my recommendation to routinely use something like edgeR to calculate scaling factors which you then use to divide the column 4 of your bedGraph/bigwig by. Once you obtained the scaling factors you can use tools such as bamCoverage from deepTools to create and scale your browser tracks (bigwig(bedGraph).

Here is a code suggestion in R. It assumes you have a standard count matrix with rows = regions and columns = samples with the raw counts as object raw.counts.

## edgeR:: calcNormFactors
tmp.NormFactors <- calcNormFactors(object = raw.counts, method = "TMM")

## if you prefer to use the DESeq2 strategy use method="RLE" instead

## raw library size:
tmp.LibSize <- colSums(raw.counts)

## calculate size factors:
SizeFactors <- tmp.NormFactors * tmp.LibSize / 1000000

## !!  NOTE!! if you are going to use these factors with the 
##     --scaleFactor option in bamCoverage from deepTools it must be:

SizeFactors.forDeepTools <- 1/SizeFactors

##     because the tool multiplies values with the provided factor
##     while the factors as calculated above are intended for division 
##

SizeFactors will now contain a scaling factor for every sample. You can now divide the count column (raw counts) of your bedGraph for each sample by its scaling factor to get normalized browser tracks. This could by done directly from a BAM file with bamCoverage from deepTools using the --scaleFactor option. Note though that --scaleFactor will multiply the counts with the provided factor, but the edgeR factors are meant for division so you would need to provide the reciprocal value to --scaleFactor.


Here is a picture I made a while ago addressing exactly this problem. It shows two biological replicates with different signal-to-noise ratios (replicate two has worse SN therefore overall lower peak counts as many reads fall into non-peak regions) for a region which I know is not differential. Using CPM (so per-million scaling based on only the read counts) shown in the top panel fails to correct for the systematic difference while TMM on the counts in peak regions (bottom panel) manages to correct for it. The middle panel is a bin-based normalization with edgeR suggested in the csaw package which might in certain situations be beneficial if library composition is very different between samples (check csaw vignette for details if you want), but lets not further discuss this here as it does not add to the topic right now.

enter image description here

ADD COMMENTlink modified 8 weeks ago • written 10 months ago by ATpoint40k

Sorry I have to come back again with two question regarding the bin based normalization (I guess you mean the loess-based normalization right ?). Since I using also this method to get my differential peaks which works quite well I would also like to use this method to scale my bedgraph/bigwig files just to keep everything the same across the analysis (and to avoid any nasty reviewer comment ;) ). However I have two questions:

  1. Before running normOffsets its recommended to filter windows which is just fine in order to identify differential peaks however for a signal track I want to scale all windows. Therefore should I perform normOffsets with all windows (so dont perform the filterWindows function) or should I calculated the scaling factors with the filtered file and set the scaling factor to "1" for all bins which got kicked out in filterWindows (although this seems to be a bit arbitrary) ?

  2. After calculating the scaling factor which is the best way to use it ? Make a bedgraph file using deeptools with same bin sizes and then multiply each row with the scaling factor calculated by normOffsets ?

Thanks again, Flo

ADD REPLYlink written 9 months ago by Flo70

Sorry for the super late reply, I missed the comment.

I do not really understand 1). What I do (in the rare case I really use bin-based normalization) is simply to use the full count matrix based on the 10kb windows (generated with bedtools makewindows and then featureCounts and feed this into the above code snipped. I do not use the code provided in the csaw manual but create the countmatrix externally.

For 2) you have different options. The simplest is to feed to norm. factors into deeptools bamCoverage --scaleFactor as described above. Alternatively you can make a bedGraph and divide the column $4 (so the coverages) by the norm. factor. Here is a Bioc thread where I asked the csaw developer about this to be sure I got it right: https://support.bioconductor.org/p/124180/

ADD REPLYlink written 7 months ago by ATpoint40k

Hello, Thank you for your reply, it really helped me, I also have a question for the ChIP-seq data, I have two bam file (one is sample, the other is input), how can I normalize the data? how can I get the scale factor? May I get the bw file through bamCompare first ? and then get the rawcount file and use your TMM method ? or other correct way?

ADD REPLYlink written 7 months ago by zhuobaowen10

Input and IP samples are fundamentally different towards library composition and violate the assumptions of the normalization. I would therefore not try to process them in the described way. I have no hands-on experience with this as I never use inputs beyond the peak calling step. deeptools implements the SES normalization for your scenario afaik, maybe check their documentation.

ADD REPLYlink written 7 months ago by ATpoint40k

OKļ¼Œthank you very much

ADD REPLYlink written 7 months ago by zhuobaowen10

Hello, I have got the scale factor according to your method, but I'm not sure which is right. can you help me to judge which is right? thank you! here is my code

bamCoverage --bam sample147.final.bam -o sample147.TMM.bw --binSize 50 --scaleFactor 0.04000290 --numberOfProcessors 4

bamCoverage --bam sample147.final.bam -o sample147.TMM.bw --binSize 50 --normalizeUsing RPGC --effectiveGenomeSize 2864785220 --scaleFactor 0.04000290 --numberOfProcessors 4
ADD REPLYlink modified 7 months ago • written 7 months ago by zhuobaowen10

I use the first approach.

ADD REPLYlink written 7 months ago by ATpoint40k

you are right, thank you so much!

ADD REPLYlink written 6 months ago by zhuobaowen10

Impressive results, @ATpoint! Can we combine this approach with ERCC spike-in? I'd like to normalise RNA-seq bigwigs. Many thanks.

ADD REPLYlink written 6 months ago by horvath.attila44440

I guess you can: How does edgeR do ERCC spike-in normalization?

ADD REPLYlink written 10 weeks ago by ATpoint40k

This is great! How would I make a standard count matrix? Using featureCounts?

ADD REPLYlink written 11 weeks ago by hkitano0

I use featureCounts indeed.

ADD REPLYlink written 10 weeks ago by ATpoint40k
0
gravatar for Malcolm.Cook
9 months ago by
Malcolm.Cook1.1k
kansas, usa
Malcolm.Cook1.1k wrote:

If you want to use quantile normalization with csaw, try this:

library(preprocessCore)       # which implements normalize.quantiles
normOffsetsNQ<- function (object, ..., assay.id = "counts") {
  mat <- assay(object, i = assay.id, withDimnames = FALSE)
  mat.cpm <- edgeR::cpm(mat, log = TRUE, ...)
  NQ<-normalize.quantiles(mat.cpm)
  assay(object, "offset") <- (NQ - mat.cpm)  / log2(exp(1))
  return(object)
}

...and use it instead of normOffsets in your csaw recipe.

To visualize the effect of normOffsetsNQ, you are best off to visualize the logFC of each group (you do have replicates, right?) for each contrast of interest at each window. Do this after performing the glmQLFTest step in your recipe, but before mergeResults. A most effective visualization for IGV genome browser is using an .igv format file with rows for each filtered window and columns of logFC for each contrast displayed as a heatmap.

Like this

This IGV snapshot of ATAC-Seq results:

  • displays .bw as line-plots which are overlayed by group using the brand new "Tracks > Overlay > Overlay by > group" allowing to eye-ball the within and between group variances quite nicely.
  • shows heatmap of logFC in each window at each contrast of interest (in my case, the first track is contrasted with each of the remaining 5 tracks).
  • shows a .bed file for each merged region at each contrast of interest.

I think it is a fool's errand to try and create a view of your TPM scaled bigwigs correspondingly quantile normalized. Alas perhaps some nasty reviewers will take you for a fool.

Good luck!

BTW - I intend soon to offer a pull request to the csaw project for this function.

ADD COMMENTlink modified 9 months ago • written 9 months ago by Malcolm.Cook1.1k

Our of interest, is there a reference for normalizing per-million scaled data with QN rather than raw counts? From what I understand, e.g. here, one starts typically from integers, no? Or doesn't it make a difference? Not a statistician myself at all as you can see, so correct me if wrong.

ADD REPLYlink modified 9 months ago • written 9 months ago by ATpoint40k

normOffsetsNQ as written above actually normalizes the log of cpm (per the log=TRUE on the edgeR::cpm call). It then computes the offsets and converts them back into "natural base". In this I am following the advice of one of the author of csaw, Aaron Lun, private communication.

I am interested in references to argue for this approach. Here is an 2005 opinion from the other author of csaw, Gordon Smyth "There exist no clear results on whether it is best to carry out quantile normalization on the raw or log scale ..." discussed in the context of micro-array analysis.

ADD REPLYlink written 9 months ago by Malcolm.Cook1.1k

I tired to use rawcount once but later resorted to HOMER for all the analysis, I want to know how do you get raw counts from ATAC seq data ? I tried deeptools but it takes a bit of time not sure why...apart from deeptools I tired to use feature count but not able to do it.It is bit more complicated than rna seq where I give bam file as input and get raw counts. In case or ATAC or chip seq I have to give a saf format etc.So anyother methods for taking out raw counts which can be easily used downstream with library like deseq2 for differential accessibility analysis .

DIffbind I tried it uses lots of resources and time

ADD REPLYlink written 9 months ago by krushnach80850
2

It is actually quite simple to use featurecounts once you figure out how, see my comment here on how to do that: It is blazingly fast due to multithreading.

C: ATAC-seq DE analysis

ADD REPLYlink written 9 months ago by ATpoint40k

I asked my doubt on that thread ..

ADD REPLYlink written 9 months ago by krushnach80850
1

This discussion is primarily about the interplay of normalization and visualization in the context of the tool csaw which I am now using for ATAC-seq and find quite performant and flexible and all-in-one I recommend you read the paper & manual.

ADD REPLYlink written 9 months ago by Malcolm.Cook1.1k

Did you submit this as a pull request? Currently looking for a method that can deal with non-linear biases and produces offsets compatible with edgeR. Any update on this? Thanks!

ADD REPLYlink written 6 weeks ago by ATpoint40k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1765 users visited in the last hour