Normalize BigWig's for # of reads in peaks and sequencing depth
1
0
Entering edit mode
4.2 years ago
Zeel • 0

My first question here.

As mentioned Corces here (Science. 2018 Oct 26;362(6413)1. Differences in quality of ATACseq experiments result in varying percentage of reads in peaks. When comparing samples using only depth normalization (i.e. CPM) the "background" reads would be considered equal to reads falling within peaks. Therefore, samples with high background are artificially depressed.

I was wondering if we could use deepTools bamCoverage to normalize by reads in peaks and depth at the same time. Could we achieve that by blacklisting the background regions adjusting the genome size to the total length on the peaks analysed and multiplying by a scale factor? Something like this:

bamCoverage -p 4 --bam input.bam -o output.bw --binSize 50 --scaleFactor 30 --blackListFileName Background.bed --normalizeUsing CPM --ignoreDuplicates --minMappingQuality 30 --effectiveGenomeSize 108645043 Genome_size_of_the analysed_peaks --ignoreForNormalization chrX chrY chrM --extendReadsde

Moreover, the --scaleFactor is applied before or after the scale factor from --normalizeUsing CPM? Are these two different scale factors?

Thanks

Deeptools ATAC-seq • 3.8k views
ADD COMMENT
2
Entering edit mode
4.2 years ago
ATpoint 82k

What this text snipped probably describes are differences in signal-to-noise ration (which is quite common in NGS). Per-million normalization alone is often insufficient to correct for this. This is why naive normalization techniques such as RPKM (or any other simple per-million method) regularily fails in benchmarking studies when it comes to comparing different samples with each other. CPM in bamCoverage is doing exactly this. scaleFactor allows you to enter custom scaling factors while CPM will calculate them based on total read counts. They cannot be used simultaneously.

Some people say for visualization simple per--million correction might be acceptable. I disagree, see link below for details and an example why this can produce flawed results. Instead use a sophisticated method such as TMM from edgeR to account for both library size (= total read counts) and composition (differences in signal/noise ratio). For code examples see:

A: ATAC-seq sample normalization (quantil normalization)

By the way, when you use bamCoverage use bin sizes < 10bp, better even binsize of 1. The default 50 produces ugly-looking peaks as it averages the counts over a too large window (just my opinion).

ADD COMMENT
0
Entering edit mode

Perfect!

I used TMM normalization with edgeR/Limma for my analysis. What I was searching for is a way to visualize in a data track what was being compared. I will try some of the code examples that you tagged. As for the bin size, I agree with you. I was just concern with the size of the output files.

Thanks again

ADD REPLY
1
Entering edit mode

In that case you could directly use the norm. factors from the DGElist object. Be sure to use the reciprocal value of it when feeding into the --scaleFactor option as suggested in the linked answer.

ADD REPLY

Login before adding your answer.

Traffic: 3001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6