Question

Normalize BigWig's for # of reads in peaks and sequencing depth

0

Entering edit mode

4.2 years ago

Zeel • 0

My first question here.

As mentioned Corces here (Science. 2018 Oct 26;362(6413)1. Differences in quality of ATACseq experiments result in varying percentage of reads in peaks. When comparing samples using only depth normalization (i.e. CPM) the "background" reads would be considered equal to reads falling within peaks. Therefore, samples with high background are artificially depressed.

I was wondering if we could use deepTools bamCoverage to normalize by reads in peaks and depth at the same time. Could we achieve that by blacklisting the background regions adjusting the genome size to the total length on the peaks analysed and multiplying by a scale factor? Something like this:

bamCoverage -p 4 --bam input.bam -o output.bw --binSize 50 --scaleFactor 30 --blackListFileName Background.bed --normalizeUsing CPM --ignoreDuplicates --minMappingQuality 30 --effectiveGenomeSize 108645043 Genome_size_of_the analysed_peaks --ignoreForNormalization chrX chrY chrM --extendReadsde

Moreover, the --scaleFactor is applied before or after the scale factor from --normalizeUsing CPM? Are these two different scale factors?

Thanks

Deeptools ATAC-seq • 3.8k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 4.2 years ago by Zeel • 0

score 2 · Accepted Answer · 2020-02-03

What this text snipped probably describes are differences in signal-to-noise ration (which is quite common in NGS). Per-million normalization alone is often insufficient to correct for this. This is why naive normalization techniques such as RPKM (or any other simple per-million method) regularily fails in benchmarking studies when it comes to comparing different samples with each other. CPM in bamCoverage is doing exactly this. scaleFactor allows you to enter custom scaling factors while CPM will calculate them based on total read counts. They cannot be used simultaneously.

Some people say for visualization simple per--million correction might be acceptable. I disagree, see link below for details and an example why this can produce flawed results. Instead use a sophisticated method such as TMM from edgeR to account for both library size (= total read counts) and composition (differences in signal/noise ratio). For code examples see:

A: ATAC-seq sample normalization (quantil normalization)

By the way, when you use bamCoverage use bin sizes < 10bp, better even binsize of 1. The default 50 produces ugly-looking peaks as it averages the counts over a too large window (just my opinion).