Question

Tutorial:Basic normalization, batch correction and visualization of RNA-seq data

50

Entering edit mode

3.6 years ago

ATpoint 82k

This tutorial walks the reader through a basic differential analysis workflow with edgeR, including basic QC via PCA and and correcting for batch effects. The example data are RNA-seq, but the principles also apply to other NGS assays, such as ATAC-seq or ChIP-seq.

The data we use are from GSE124167. The authors performed RNA-seq on treated and untreated mice in a biological triplicate, then splitted the obtained RNA into three parts and prepared NGS libraries using three different kits. We can see the three kits as three different "batches". The kits will be called "Pico", "Illumina" and "V4" in this tutorial, the treatments are just "treated" and "untreated".

Use the code chunk below to read the data directly from GEO and to bring it into a format ready for analysis in edgeR:

Load data

Setup edgeR

Now that we have the counts object we can start with edgeR:


# BiocManager::install("edgeR")
# install.packages(c("dplyr", "ggplot2"))

library(edgeR)
library(ggplot2)
library(dplyr)

# Make a DGEList and add metadata
y <- DGEList(counts = counts)

y$samples$kit       <- c(rep("Pico", 6), rep("Illumina", 6), rep("V4", 6))
y$samples$treatment <- rep(c(rep("treated", 3), rep("untreated", 3)), 3)

# Normalize and obtain logcounts for QC
y <- calcNormFactors(y)
logCPMs <- cpm(y, log = TRUE)

PCA

PCA is a useful technique for initial QC that allows to diagnose obvious batch effects. While wrapper packages for PCA exist, we use the simple prcomp() function in R for this which is fast and efficient, yet simple to code up. For PCA one commonly uses the most variable genes in the dataset, here we use the top-1000 most variable genes selected based on rowwise variance:


# Calculate rowwise variance
rv <- apply(logCPMs, 1, var)

# Sort decreasingly and take top 1000
o <- order(rv, decreasing=TRUE)
top1000 <- head(o, 1000)

# From the logCPMs subset for the top-1000
logCPM_top1000 <- logCPMs[top1000,]

# Run PCA
pca <- prcomp(t(logCPM_top1000))

# Combine PCA coordinates with the metadata from the DGEList
to_plot <- data.frame(pca$x, y$samples)

# Calculate how many % of total variance is explained by each principal component
percentVar <- pca$sdev^2 / sum( pca$sdev^2 )*100

# We focus here on PC1 and PC2
use.pcs <- c(1,2)
labs <- paste0(paste0("PC", use.pcs, " - "), paste0("Var.expl = ", round(percentVar[use.pcs], 2), "%"))

ggplot(to_plot, aes(x=PC1, y=PC2, color=kit, shape=treatment)) + 
  geom_point(size=3) +
  xlab(labs[1]) + ylab(labs[2])

enter image description here

As we can see the most variation (that is PC1) in the data actually comes from the library prep kit rather than the treatment. PC1 explains more than 50% of the variance while PC2 explains about 20% of the total variance and is driven by the treatment. This variantion in PC1 which is not treaatment but choice of kit is a so-called "batch effect". Since each condition (treated and untreated) has samples of all three kits we can correct for this batch effect.

For the differential analysis in edgeR this is typically done by including the batch effect into the experimental design (see below). For visual inspection we can first use removeBatchEffect() from limma and then repeat the PCA.


# correct for the batch which here is the "kit"
batch <- factor(y$samples$kit)

logCPMs_corrected <- limma::removeBatchEffect(logCPMs, batch = batch)

# repeat PCA as before, using the same genes
logCPM_corrected_top1000 <- logCPMs_corrected[top1000,]

# Run PCA
pca <- prcomp(t(logCPM_corrected_top1000))

# Combine PCA coordinates with the metadata from the DGEList
to_plot <- data.frame(pca$x, y$samples)

# Calculate how many % of total variance is explained by each principal component
percentVar <- pca$sdev^2 / sum( pca$sdev^2 )*100

# We focus here on PC1 and PC2
use.pcs <- c(1,2)
labs <- paste0(paste0("PC", use.pcs, " - "), paste0("Var.expl = ", round(percentVar[use.pcs], 2), "%"))

ggplot(to_plot, aes(x=PC1, y=PC2, color=kit, shape=treatment)) + 
  geom_point(size=3) +
  xlab(labs[1]) + ylab(labs[2])

enter image description here

The batch (=kit) effect is removed and the treatment is now the major source of variation.

Differential expression

Now run a typical edgeR analysis, accounting for the batch effect, see also the edgeR user guide. We visualize the results with two common types of plots. First the MA-plot which is average expression versus fold change and second the Volcano-plot which is fold change versus pvalue.


# Design accounting for kit (=batch)
design <- model.matrix(~kit+treatment, y$samples)

# QLF workflow from edgeR
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)

# see head(design) => the fourth column is treatment which is what we want to test
fit  <- glmQLFTest(fit, coef = 4)

# get stats as a data.frame
tt <- data.frame(topTags(fit, n=Inf))

# Classify genes into significantly up and down
tt_modified <- tt %>% 
               mutate(status=factor(case_when(logFC>0 & FDR<0.05 ~ "up",
                                              logFC<0 & FDR<0.05 ~ "down",
                                              TRUE ~ "not.signif"),
                                    levels=c("up", "not.signif", "down")))

# MA-plot
ggplot(tt_modified, aes(x=logCPM, y=logFC, color=status)) +
  geom_point(size=1) +
  scale_color_manual(values=c("firebrick", "grey", "dodgerblue")) +
  ggtitle("MA plot")

# Volcano (logFC vs -log10(pvalue -- I prefer FDR))
ggplot(tt_modified, aes(x=logFC, y=-log10(FDR), color=status)) +
  geom_point(size=1) +
  scale_color_manual(values=c("firebrick", "grey", "dodgerblue")) +
  ggtitle("Volcano-plot")

enter image description here

edger RNA-seq • 8.7k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 3.6 years ago by ATpoint 82k

0

Entering edit mode

Here is the code as a single Rmarkdown document.

ADD REPLY • link 16 months ago by ATpoint 82k