Tutorial: Basic normalization, batch correction and visualization of RNA-seq data
gravatar for ATpoint
11 days ago by
ATpoint38k wrote:

This extensive tutorial is intended to provide some basics towards initial quality control (QC), normalization, batch correction and visualization of NGS count data. The example data are RNA-seq, but the principles also apply to other NGS assays, such as ATAC-seq or ChIP-seq.

The data we use are from GSE124167. The authors performed RNA-seq on treated and untreated mice in a biological triplicate, then splitted the obtained RNA into three parts and prepared NGS libraries using three different kits. We can see the three kits as three different "batches". The kits will be called "Pico", "Illumina" and "V4" in this tutorial, the treatments are just "treated" and "untreated".

If you want to obtain the raw data please download the three supplementary files from this GEO entry which are from the publication Sarantopoulou et al.. After downloading the three files (at the bottom of the page) switch (from inside R) into the directory that contains the files (using setwd()) and then run the code below to create a count matrix that is going to be used in this tutorial:

Lets start:

1. Data normalization and visualization

NGS data always require normalization. See for example this StatQuest video explaining the edgeR normalization.


## construct a DGEList which is the edgeR standard input format
y <- DGEList(counts = counts)

## add metadata based on the colnames(counts), these are the information about
## 1) kit (Illumina, V4, Pico) and 2) the treatment (treated, untreated):
y$samples$kit       <- c(rep("Pico", 6), rep("Illumina", 6), rep("V4", 6))
y$samples$treatment <- rep(c(rep("treated", 3), rep("untreated", 3)), 3)

## normalize using TMM
y <- calcNormFactors(y)

## obtain normalized data on the log2-scale
logCPMs <- cpm(y, log = TRUE, prior.count = 1)

It might make sense to check whether the normalization worked properly. A common assumption in differential analysis is that most genes do not change, hence have a fold change of nearly zero. We can visualize this using pairwise MA-plots. An MA-plot is a scatterplot with the average expression between two samples on the x-axis, and the fold change between the two samples on the y-axis, both on the log2-scale. The below code will produce an MA-plot, plotting the average of the Pico_treated versus the average of the Pico_untreated group as an example. For that we simply average the logCPMs and logFCs of the three samples per group. The function for the MA-plots in the below code (ggMAplot()) is a custom wrapper script available here, but basically it comes down to: plot(x = logCPM, y = logFC).

The function ggMAplot() simply gives a nicer image using ggplot2 and has some options to customize results which we are going to use later to visualize the differential analysis results.

## Pico treated (1:3) vs untreated (4:6) as an example
average.logCPM.Pico <- rowMeans(logCPMs[,c(1:3, 4:6)])
average.logFC.Pico  <- rowMeans(logCPMs[,1:3]) - rowMeans(logCPMs[,4:6])

## Now use the function ggMAplot to visualize results.


gg1 <- 
ggMAplot(Xval = average.logCPM.Pico,
         Yval = average.logFC.Pico,
         basic.Color = "black",
         title.Text = "MA-plot: Pico (treated) vs Pico (untreated)",
         subtitle.Text = "counts normalized with TMM from edgeR",
         Ylim = c(-6,6),
         use.PointDensity = TRUE)



As we can see, the bulk of the data cloud is centered along y = 0, indicating proper normalization. Feel free to run the above code on the un-normalized but log2(x+1)-transformed data to explore how MA-plots look in that case.

The user is now free to check the averaged groups against each other or even to produce plots for all single-sample pairwise combinations. If doing the latter I recommend to use the smoothScatter() function from base R which produces similar plots as the above example but notably faster and smaller in size since smoothScatter does not return individual data points but density estimates for the entire dataset. It is also recommended to wrap such as an approach into a function that automates the plotting rather than writing code for every single comparison between samples.

Note: In my experience with RNA-seq the normalization typically works pretty well with this standard approach. Other assays such as ATAC-seq and especially ChIP-seq might require different approaches. The reader is referred to the manual of the Bioconductor package csaw which extensively discusses alternative normalization approaches including bin-based and loess normalization.

2. Principal Component Analysis (PCA)

The next step after assessing successful normalization should be a PCA to control the data for potential batch effects. The reader is referred to the introduction of the PCAtools package for some background information on PCA. We will use this package below.

PCA is tyically performed on the normalized and log2-transformed data. Also, one typically chooses a subset of genes for PCA. This subset should include the most informative genes. A straight-forward technique is to e.g. use the 1000 genes with the highest variance between samples as a proxy for genes changing between condition (therefore suited to separate samples in PCA).


## take the top 1000 most variable genes. What we obtain here is the row index of the most variable genes.
## For this we use rowVars() to get the row-wise variance of the logCPMs and then use head on the 
## decreasingly-ordered output.
top1000.order <- head(order(matrixStats::rowVars(logCPMs), decreasing = TRUE), 1000)

## Now run PCA:
## PCAtools:
PCA <- PCAtools::pca(mat = logCPMs[top1000.order,])

## add metadata from DGEList (the information which sample belongs to which treatment/kit):
PCA$metadata <- y$samples

## make a biplot:
bp1 <- 
PCAtools::biplot(pcaobj = PCA, 
                 colby = "kit",
                 shape = "treatment", 
                 legendPosition = "bottom",
                 title = "PCA on the top-1000 most variable logCPMs",
                 subtitle = "no batch correction was applied")



As we can see the most variation in the data actually comes from the library prep. kit rather than the treatment. PC1 explains 53.31% of the variance and is driven by the different kits while PC2 is driven by the treatment, explaining 22.81% of the total variance.

Note: This result nicely illustrates why it is bad practice to combine samples from different sources. Biostars users often ask whether they could combine e.g. cancer samples from study A and normal samples from study B. The answer is typically no since these results would be dominated by batch effects from e.g. different lab protocols.

In our case we can attempt to correct for the batch effects since we have replicates of both the treated and untreated samples in all three batches (=the kits). We could not do this if kit was nested with treatment, so if all treated samples were created with e.g. the Pico kit and all untreated samples were created with the Illumina kit. This situation would be identical to the aforementioned example with all cancer samples coming from study A and all normals from study B. Cancer/normal would here be confounded by study.

Lets use limma::removeBatchEffect() to attemt batch correction. Note, this is only a diagnostic. As will be explained later, batch removal during differential analysis will be handled differently by including the batch information into the experimental design.


## see whether we can remove the kit-associated batch effect.
logCPM.batchRemoved <- limma::removeBatchEffect(x = logCPMs,
                                                batch = y$samples$kit)

## PCA on batch-corrected data using the exact same genes as before but on the batch-corrected logCPMs:
PCA.batchRemoved <- PCAtools::pca(mat = logCPM.batchRemoved[top1000.order,])

## transfer metadata as before:
PCA.batchRemoved$metadata <- y$samples

## repeat biplot:
bp2 <- 
PCAtools::biplot(pcaobj = PCA.batchRemoved, 
                 colby = "kit", 
                 shape = "treatment", 
                 legendPosition = "bottom",
                 title = paste0("PCA on the top-1000 most variable logCPMs"),
                 subtitle = "batch correction with removeBatchEffect()")



As we can see the batch correction seems to be beneficial here. The samples now cluster mostly by treatment rather than kit (=batch).

The batch-corrected logCPMs (of arithmetic CPMs when removing the log2 by 2^batchRemoved-1) can be useful for downstream analysis such as clustering or plotting functions.

Since limma::removeBatchEffect() might introduce negative values during the batch correction which might mess up e.g. plotting functions expecting 0 as the smallest values I recommend checking out the ComBat_seq() function from the Bioconductor package sva. It operates directly on the raw counts, returning raw batch-corrected counts preserving the integer-nature of the data preserving zeros as the smallest values. These corrected data can then be transformed to CPMs with the same code as above (calcNormFactors, cpm functions).

3. Differential testing

The below code will perform a simple DE analysis with edgeR including batch into the design. For the sake of this tutorial we will pretend that the goal here of the DE testing is the comparison of the treated vs. untreated condition and the kits were the batch effect. Of course the authors of the study purposely used different kits since the paper was about differences in library prep methods, but we simply pretend this "batch" was an unwanted accident.

First we do DE testing without batch in the design, then repeat with batch included:

## make a design with only the treatment condition:
Design.noBatch <- model.matrix(~0 + treatment, data = y$samples)
colnames(Design.noBatch) <- gsub("treatment", "", colnames(Design.noBatch))

## make a contrast:
Contrast.noBatch <- makeContrasts(Treated_vs_Untreated = (treated - untreated), 
                                  levels = Design.noBatch)

## default edgeR workflow (FilterByExpr, dispersion, model fit, QLF test, topTags):
y.noBatch     <- y[filterByExpr(y, design = Design.noBatch),]
y.noBatch     <- estimateDisp(y.noBatch, design = Design.noBatch)
y.noBatch$fit <- glmQLFit(y.noBatch, design = Design.noBatch)
y.noBatch$qlf <- glmQLFTest(glmfit = y.noBatch$fit, contrast = Contrast.noBatch[,1]) 
y.noBatch$tt  <- topTags(y.noBatch$qlf, n=Inf)$table

## Illustrate with an MA-plot that highlights significant genes:
gg2 <- 
ggMAplot(Xval = y.noBatch$tt$logCPM, 
         Yval = y.noBatch$tt$logFC, 
         Pval = y.noBatch$tt$FDR, 
         Pval.Threshold = 0.05,
         title.Text = "Treated vs Untreated", 
         subtitle.Text = "no batch correction")



Number of genes being below the significance threshold:

summary(y.noBatch$tt$FDR < 0.05)
Mode   FALSE    TRUE 
logical   10001    2029

We see that 2029 genes are called significant here at FDR = 0.05. Lets repeat this with batch included into the design.

## make a design (kit is batch in this case):
Design.BatchInDesign <- model.matrix(~0 + treatment + kit, data = y$samples)
colnames(Design.BatchInDesign) <- gsub("treatment", "", colnames(Design.BatchInDesign))

## make a contrast:
Contrast.BatchInDesign <- makeContrasts(Treated_vs_Untreated = (treated - untreated), 
                                        levels = Design.BatchInDesign)

## default edgeR workflow (FilterByExpr, dispersion, model fit, QLF test, topTags):
y.BatchInDesign     <- y[filterByExpr(y, design = Design.BatchInDesign),]
y.BatchInDesign     <- estimateDisp(y.BatchInDesign, design = Design.BatchInDesign)
y.BatchInDesign$fit <- glmQLFit(y.BatchInDesign, design = Design.BatchInDesign)
y.BatchInDesign$qlf <- glmQLFTest(glmfit = y.BatchInDesign$fit, contrast = Contrast.BatchInDesign[,1]) 
y.BatchInDesign$tt  <- topTags(y.BatchInDesign$qlf, n=Inf)$table

## Illustrate with an MA-plot that highlights significant genes:
gg3 <- 
ggMAplot(Xval = y.BatchInDesign$tt$logCPM, 
         Yval = y.BatchInDesign$tt$logFC, 
         Pval = y.BatchInDesign$tt$FDR, 
         Pval.Threshold = 0.05,
         title.Text = "Treated vs Untreated", 
         subtitle.Text = "batch information in the design")



Genes below significance threshold of 0.05:

summary(y.BatchInDesign$tt$FDR < 0.05)
Mode   FALSE    TRUE 
logical    6679    6283

As we can observe we now have > 6000 genes being below the FDR threshold of 0.05 while we had ~2000 genes when batch was not being accounted for. This makes sense since we already saw in the initial PCA plots that batch correction removed much of the kit-associated differences which introduced unwanted technical variation into the data.

Beyond MA-plots one might be interested to visualize the relationship between logFC and significance. Volcano plots are suitable for this, e.g. via the EnhancedVolcano package.

This tutorial here is neither comprehensive, nor does it claims to be a gold-standard but rather illustrates my personal best-practices.

Comments, suggestions and constructive feedback very welcome.

ADD COMMENTlink modified 11 days ago • written 11 days ago by ATpoint38k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1112 users visited in the last hour