Question

Closed:Normalization of BigWig files using TMM from edgeR

7

Entering edit mode

4.7 years ago

ATpoint 82k

Bigwig files are useful both for visualization of NGS experiments such as ATAC-seq or ChIP-seq on a genome browser as well as for plotting profiles and heatmaps around certain genomic regions.

For this one typically normalizes the bigwig files to correct for read depth. Still simple correction for depth may not be sufficient to account for changes in library composition. If the term library composition and normalization is is new to you, check out these videos (TMM normalization, DESeq2 normalization).

Normalization techniques that account for both depth and composition are e.g. TMM from edgeR or RLE from DESeq2, assuming that the majority of regions between samples does not change.

This tutorial provides example code on how to obtain and use TMM size factors to create a normalized bigwig starting from BAM files, essentially using only edgeR (within R) and bamCoverage from deeptools. We assume to have a standard ATAC-seq experiment.

last update: 21.08.2019 => Added sentence about choosing between peaks and large bins for normalization as suggested in csaw manual.

1. Create a count matrix

TMM calculation in edgeR starts from a count matrix with columns being the samples and rows the genes / regions. For this we need a file with reference regions that we base our normalization on. This could for example be obtained by calling peaks on the BAM files we are interested in (e.g. using macs2 or genrich) followed by merging the peak regions to a consensus list (e.g. using bedtools merge). Alternatively, as suggested in the manual of the csaw package, one might use large bins of e.g. 10kb across the genome as reference regions if one expects the changes between samples to be large. See the manual for details. Here we assume to already have such a reference file called reference_peaks.bed.

We use featureCounts from subread to count reads per peak for every sample. For this we have to convert the BED file into the featureCounts SAF format.

## BED (0-based) to SAF (1-based):
awk 'OFS="\t" {print $1"_"$2+1"_"$3, $1, $2+1, $3, "+"}' reference_peaks.bed > reference_peaks.saf

## quantify bam files over peaks:
featureCounts -a reference_peaks.saf -F SAF -o reference_peaks.countmatrix *.bam

2. Load into R to calculate scaling factors

require(data.table)
require(edgeR)

## turn off exponential notation to avoid rounding errors
options(scipen=999) 

## Load count matrix, removing all columns that we are not interested in:
raw.counts <- fread("reference_peaks.countmatrix", data.table=FALSE, header=TRUE, skip=c(1))
raw.counts <- raw.counts[,7:ncol(raw.counts)]

## effective normalization factors are the product 
## of TMM factor and library size for each sample:
norm.factor   <- calcNormFactors(object = raw.counts, method = c("TMM"))
lib.size      <- colSums(raw.counts)
final.factor  <- norm.factor * lib.size

## as one typically scales reads "per million" we divide this factor by 10^6
## we also need the reciprocal value when using bamCoverage later, you'll see why,
## see *comment below
perMillion.factor <- (final.factor / 1000000)^-1

## write to disk:
write.table(x = data.frame(Sample     = names(perMillion.factor),
                           NormFactor = perMillion.factor),
            file = "normFactors.txt", sep="\t", quote = FALSE,
            col.names = TRUE, row.names = FALSE)

Content of the normFactors.txt file could be:

Sample  NormFactor
sample1.bam 0.272206963159581
sample2.bam 0.463378223152849

3. Use bamCoverage to create the normalized bigwigs

for i in *.bam
  do

  ## extract norm.factor from the file we created in R for the current sample:
  NormFactor=$(grep "${i}" normFactors.txt | cut -f2)

  ## use bamCoverage:
  bamCoverage -b $i -o ${i%.bam}_TMM.bigwig -bs 1 --scaleFactor ${NormFactor}
  done

Basically all recent benchmarking papers comparing normalization techniques have shown that naive per-million methods are inferior to e.g. TMM and RLE as they fail to account for composition changes. Given the relatively low effort to run the above code I personally changed my workflows to use this as the default methods in my work (given that the assumption of most regions not changing between conditions holds). Feel free to wrap this code into a script or workflow to make execution more convenient.

*Comment on the R code above: The reason we calculated the reciprocal value in R was that --scaleFactor is used to multiply each value of the bigwig with it, but the TMM factors intended for division, that's all.

I am happy to receive feedback of any kind, feel free to comment.

bigwig TMM deeptools edgeR • 1.9k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 4.7 years ago by ATpoint 82k