Bigwig files are useful both for visualization of NGS experiments such as ATAC-seq or ChIP-seq on a genome browser as well as for plotting profiles and heatmaps around certain genomic regions.
For this one typically normalizes the bigwig files to correct for read depth. Still simple correction for depth may not be sufficient to account for changes in library composition. If the term library composition and normalization is is new to you, check out these videos (TMM normalization, DESeq2 normalization).
Normalization techniques that account for both depth and composition are e.g. TMM
from edgeR
or RLE
from DESeq2
, assuming that the majority of regions between samples does not change.
This tutorial provides example code on how to obtain and use TMM
size factors to create a normalized bigwig starting from BAM files, essentially using only edgeR
(within R
) and bamCoverage
from deeptools. We assume to have a standard ATAC-seq experiment.
last update: 21.08.2019 => Added sentence about choosing between peaks and large bins for normalization as suggested in csaw
manual.
1. Create a count matrix
TMM
calculation in edgeR
starts from a count matrix with columns being the samples and rows the genes / regions.
For this we need a file with reference regions that we base our normalization on. This could for example be obtained by calling peaks on the BAM files we are interested in (e.g. using macs2
or genrich
) followed by merging the peak regions to a consensus list (e.g. using bedtools merge
). Alternatively, as suggested in the manual of the csaw
package, one might use large bins of e.g. 10kb across the genome as reference regions if one expects the changes between samples to be large. See the manual for details. Here we assume to already have such a reference file called reference_peaks.bed
.
We use featureCounts
from subread to count reads per peak for every sample. For this we have to convert the BED file into the featureCounts
SAF format.
## BED (0-based) to SAF (1-based):
awk 'OFS="\t" {print $1"_"$2+1"_"$3, $1, $2+1, $3, "+"}' reference_peaks.bed > reference_peaks.saf
## quantify bam files over peaks:
featureCounts -a reference_peaks.saf -F SAF -o reference_peaks.countmatrix *.bam
2. Load into R to calculate scaling factors
require(data.table)
require(edgeR)
## turn off exponential notation to avoid rounding errors
options(scipen=999)
## Load count matrix, removing all columns that we are not interested in:
raw.counts <- fread("reference_peaks.countmatrix", data.table=FALSE, header=TRUE, skip=c(1))
raw.counts <- raw.counts[,7:ncol(raw.counts)]
## effective normalization factors are the product
## of TMM factor and library size for each sample:
norm.factor <- calcNormFactors(object = raw.counts, method = c("TMM"))
lib.size <- colSums(raw.counts)
final.factor <- norm.factor * lib.size
## as one typically scales reads "per million" we divide this factor by 10^6
## we also need the reciprocal value when using bamCoverage later, you'll see why,
## see *comment below
perMillion.factor <- (final.factor / 1000000)^-1
## write to disk:
write.table(x = data.frame(Sample = names(perMillion.factor),
NormFactor = perMillion.factor),
file = "normFactors.txt", sep="\t", quote = FALSE,
col.names = TRUE, row.names = FALSE)
Content of the normFactors.txt
file could be:
Sample NormFactor
sample1.bam 0.272206963159581
sample2.bam 0.463378223152849
3. Use bamCoverage to create the normalized bigwigs
for i in *.bam
do
## extract norm.factor from the file we created in R for the current sample:
NormFactor=$(grep "${i}" normFactors.txt | cut -f2)
## use bamCoverage:
bamCoverage -b $i -o ${i%.bam}_TMM.bigwig -bs 1 --scaleFactor ${NormFactor}
done
Basically all recent benchmarking papers comparing normalization techniques have shown that naive per-million methods are inferior to e.g. TMM
and RLE
as they fail to account for composition changes. Given the relatively low effort to run the above code I personally changed my workflows to use this as the default methods in my work (given that the assumption of most regions not changing between conditions holds). Feel free to wrap this code into a script or workflow to make execution more convenient.
*Comment on the R code above: The reason we calculated the reciprocal value in R was that --scaleFactor
is used to multiply each value of the bigwig with it, but the TMM factors intended for division, that's all.
I am happy to receive feedback of any kind, feel free to comment.