ATAC-seq sample normalization
2
17
Entering edit mode
22 months ago
Flo ▴ 170

Hi everyone, first of all Iam a biologist so sorry for the (potentially) very basic question.

We performed ATAC-seq (treatment vs. control in duplicates) and processed the data through the Encode pipeline. The QC (especially the TSS enrichment) shows for both control samples a excellent result whereas both treatment samples are borderline ( but not so bad to trash it). Also the visual inspection of the peaks looks okay (very similar peak pattern between control and treatment). However we noticed that the peaks in the treatment samples are overall smaller compare to the control (bigwig TPM normalized) so we concluded that we have a potential enrichment bias. Since we do not knock down any global factor it is most likely a technical artefact. We continued to identify differential accessible regions using the csaw package ("trended" normalization). The resulting regions are highly enriched for binding motives of biological meaningful transcription factors (identified with HOMER) and also close to relevant genes.

So far so good... however now I have problems to visualise those regions. Since the TPM normalization of the bigwig files does not account for the global enrichment bias (correct ?) the plotted heatmaps of the differential accessible regions do not reflect the result from csaw (control has always higher signal compare to treatment). Please correct me when Iam wrong but in this case we could perform instead of the TPM normalization a quantile normalization right ? Is there a easy way to do this either with bam or bigwig files ? Otherwise I could get a count matrix from deeptools and feed it into the CQN package. However the cqn function requires a covariate and to be honest Iam not sure what this is in my data (sequencing depths ?). After the normalization I could convert the resulting file back into a bigwig file and us it for plotting.

Thanks for any suggestions.

ChIP-Seq atac-seq normalization • 6.7k views
36
Entering edit mode
22 months ago
ATpoint 55k

What you describe seems to be a difference in signal-to-noise ratio which is not uncommon.
This is where more elaborate normalization methods such as TMM from edgeR or RLE from DESeq2 come into play.

See the following links on why these methods are superior. The videos talk about RNA-seq but simply consider genes as "regions" in any assay such as ATAC-seq or ChIP-seq:

Also, this is where more simple methods such as TPM/RPKM/FPKM which only correct for total library read count
(=sequencing depth) typically make limited sense as they fail to correct for the systematic differences in library composition due to the differences in either the peak landscape or signal/noise ratio. I cannot comment on Quantile Normalization as I have no experience with it but I had good results with both TMM and RLE. Therefore my recommendation to routinely use something like edgeR to calculate scaling factors which you then use to divide the column 4 of your bedGraph/bigwig by.

BedGraph/Bigwig creation:

Among other tools this could be done by bamCoverage from deeptools or genomecov from bedtools. Both tools start from BAM files and offer plenty of options for customization.

Scaling Factors:

Here is a code suggestion in R to calculate scaling factors to scale the bedGraph/bigwig files. It assumes you have already created a count matrix with rows = regions and columns = samples, e.g. based on a list of peaks called from the underlying experiment. This could be done with tools such as macs2 or Genrich followed by featureCounts. It is assumed that the matrix of raw counts is now loaded in R as object raw.counts:

## edgeR:: calcNormFactors
NormFactor <- calcNormFactors(object = raw.counts, method = "TMM")

## if you prefer to use the DESeq2 strategy use method="RLE" instead

## raw library size:
LibSize <- colSums(raw.counts)

## calculate size factors:
SizeFactors <- NormFactor * LibSize / 1000000

SizeFactors.Reciprocal <- 1/SizeFactors


SizeFactors will now contain a factor for every sample which can be used to divide the 4th colun of a bedGraph/bigwig by. Both the aforementioned tools (bamCoverage and genomecov) have options though to directly scale the output by a factor (--scaleFactor or --scale respectively). !! Note though that these options will multiply rather than divide the 4th column with the factor, therefore you would need to provide the reciprocal as mentioned in the code chunk above.

Illustration:

Here is a picture I made a while ago addressing exactly this problem. It shows two biological replicates with different signal-to-noise ratios (replicate two has worse SN therefore overall lower peak counts as many reads fall into non-peak regions) for a region which I know is not differential. Using CPM (so per-million scaling based on only the read counts) shown in the top panel fails to correct for the systematic difference while TMM on the counts in peak regions (bottom panel) manages to correct for it. The middle panel is a bin-based normalization with edgeR suggested in the csaw package which might in certain situations be beneficial if library composition is very different between samples (check csaw vignette for details if you want), but lets not further discuss this here as it does not add to the topic right now.

Bigwigs in R:

If you want to visualize bigwigs in R one might want to check this handy tool that was recently posted here which seems to be easy to use yet powerful and visually appealing:

trackplot: Fast and minimal dependency standalone R script to generate IGV style locus tracks from bigWig files

0
Entering edit mode

Sorry I have to come back again with two question regarding the bin based normalization (I guess you mean the loess-based normalization right ?). Since I using also this method to get my differential peaks which works quite well I would also like to use this method to scale my bedgraph/bigwig files just to keep everything the same across the analysis (and to avoid any nasty reviewer comment ;) ). However I have two questions:

1. Before running normOffsets its recommended to filter windows which is just fine in order to identify differential peaks however for a signal track I want to scale all windows. Therefore should I perform normOffsets with all windows (so dont perform the filterWindows function) or should I calculated the scaling factors with the filtered file and set the scaling factor to "1" for all bins which got kicked out in filterWindows (although this seems to be a bit arbitrary) ?

2. After calculating the scaling factor which is the best way to use it ? Make a bedgraph file using deeptools with same bin sizes and then multiply each row with the scaling factor calculated by normOffsets ?

Thanks again, Flo

0
Entering edit mode

Sorry for the super late reply, I missed the comment.

I do not really understand 1). What I do (in the rare case I really use bin-based normalization) is simply to use the full count matrix based on the 10kb windows (generated with bedtools makewindows and then featureCounts and feed this into the above code snipped. I do not use the code provided in the csaw manual but create the countmatrix externally.

For 2) you have different options. The simplest is to feed to norm. factors into deeptools bamCoverage --scaleFactor as described above. Alternatively you can make a bedGraph and divide the column \$4 (so the coverages) by the norm. factor. Here is a Bioc thread where I asked the csaw developer about this to be sure I got it right: https://support.bioconductor.org/p/124180/

1
Entering edit mode

Hello, I have got the scale factor according to your method, but I'm not sure which is right. can you help me to judge which is right? thank you! here is my code

bamCoverage --bam sample147.final.bam -o sample147.TMM.bw --binSize 50 --scaleFactor 0.04000290 --numberOfProcessors 4

bamCoverage --bam sample147.final.bam -o sample147.TMM.bw --binSize 50 --normalizeUsing RPGC --effectiveGenomeSize 2864785220 --scaleFactor 0.04000290 --numberOfProcessors 4

0
Entering edit mode

I use the first approach.

0
Entering edit mode

you are right, thank you so much!

0
Entering edit mode

Hello, Thank you for your reply, it really helped me, I also have a question for the ChIP-seq data, I have two bam file (one is sample, the other is input), how can I normalize the data? how can I get the scale factor? May I get the bw file through bamCompare first ? and then get the rawcount file and use your TMM method ? or other correct way?

0
Entering edit mode

Input and IP samples are fundamentally different towards library composition and violate the assumptions of the normalization. I would therefore not try to process them in the described way. I have no hands-on experience with this as I never use inputs beyond the peak calling step. deeptools implements the SES normalization for your scenario afaik, maybe check their documentation.

0
Entering edit mode

OK，thank you very much

0
Entering edit mode

Impressive results, @ATpoint! Can we combine this approach with ERCC spike-in? I'd like to normalise RNA-seq bigwigs. Many thanks.

0
Entering edit mode
0
Entering edit mode

This is great! How would I make a standard count matrix? Using featureCounts?

1
Entering edit mode

I use featureCounts indeed.

0
Entering edit mode

Thank you @ATpoint for the very helpful tutorial. For the last two tracks of the IGV plot, did you plot scaled signal-to-noise ratios using TMM scaling factors or scaled counts normalized by scaling factors? If it's the latter, how do we account for background noise?

1
Entering edit mode

Hi, thank you, glad you find it helpful. The last two is the raw read counts divided by the scaling factor that comes out when running the above code on a count matrix based on the peaks. So that is not a ratio, it is normalized counts, it would be "the latter". I think this takes good care of noise because if you have higher noise in sampleA rather than sampleB then at the same total read count the peaks in A would be generally lower, and the TMM on the peak counts then kind of lift up the counts of all peaks to be comparable between samples. There is a great chapter in Aaron Lun's csaw book on normalization in the ChIP-seq (which applies to ATAC-seq as well) context that discusses the various aspects to consider, please see at http://bioconductor.org/books/3.13/csawBook/chap-norm.html

0
Entering edit mode
21 months ago
Malcolm.Cook ★ 1.3k

If you want to use quantile normalization with csaw, try this:

library(preprocessCore)       # which implements normalize.quantiles
normOffsetsNQ<- function (object, ..., assay.id = "counts") {
mat <- assay(object, i = assay.id, withDimnames = FALSE)
mat.cpm <- edgeR::cpm(mat, log = TRUE, ...)
NQ<-normalize.quantiles(mat.cpm)
assay(object, "offset") <- (NQ - mat.cpm)  / log2(exp(1))
return(object)
}


...and use it instead of normOffsets in your csaw recipe.

To visualize the effect of normOffsetsNQ, you are best off to visualize the logFC of each group (you do have replicates, right?) for each contrast of interest at each window. Do this after performing the glmQLFTest step in your recipe, but before mergeResults. A most effective visualization for IGV genome browser is using an .igv format file with rows for each filtered window and columns of logFC for each contrast displayed as a heatmap.

This IGV snapshot of ATAC-Seq results:

• displays .bw as line-plots which are overlayed by group using the brand new "Tracks > Overlay > Overlay by > group" allowing to eye-ball the within and between group variances quite nicely.
• shows heatmap of logFC in each window at each contrast of interest (in my case, the first track is contrasted with each of the remaining 5 tracks).
• shows a .bed file for each merged region at each contrast of interest.

I think it is a fool's errand to try and create a view of your TPM scaled bigwigs correspondingly quantile normalized. Alas perhaps some nasty reviewers will take you for a fool.

Good luck!

BTW - I intend soon to offer a pull request to the csaw project for this function.

0
Entering edit mode

Our of interest, is there a reference for normalizing per-million scaled data with QN rather than raw counts? From what I understand, e.g. here, one starts typically from integers, no? Or doesn't it make a difference? Not a statistician myself at all as you can see, so correct me if wrong.

0
Entering edit mode

normOffsetsNQ as written above actually normalizes the log of cpm (per the log=TRUE on the edgeR::cpm call). It then computes the offsets and converts them back into "natural base". In this I am following the advice of one of the author of csaw, Aaron Lun, private communication.

I am interested in references to argue for this approach. Here is an 2005 opinion from the other author of csaw, Gordon Smyth "There exist no clear results on whether it is best to carry out quantile normalization on the raw or log scale ..." discussed in the context of micro-array analysis.

0
Entering edit mode

I tired to use rawcount once but later resorted to HOMER for all the analysis, I want to know how do you get raw counts from ATAC seq data ? I tried deeptools but it takes a bit of time not sure why...apart from deeptools I tired to use feature count but not able to do it.It is bit more complicated than rna seq where I give bam file as input and get raw counts. In case or ATAC or chip seq I have to give a saf format etc.So anyother methods for taking out raw counts which can be easily used downstream with library like deseq2 for differential accessibility analysis .

DIffbind I tried it uses lots of resources and time

2
Entering edit mode

It is actually quite simple to use featurecounts once you figure out how, see my comment here on how to do that: It is blazingly fast due to multithreading.

C: ATAC-seq DE analysis

0
Entering edit mode

1
Entering edit mode

This discussion is primarily about the interplay of normalization and visualization in the context of the tool csaw which I am now using for ATAC-seq and find quite performant and flexible and all-in-one I recommend you read the paper & manual.