Question: How to graphically tell if data has been normalized?
1
19 months ago by
Arindam Ghosh200
India
Arindam Ghosh200 wrote:

Is there a method to tell if my RNA-seq data has been normalized? I used DESeq2 to normalize my feature count data and they plotted the log transformed normalized counts with box-plots. Is this a good way to analyse? I had referred some articles where they had used this method. For median normalization the median line would be at the same level and for quartile normalization the third quartile is at the same level. For DESeq2, it uses RLE method. So how can I graphically explain this? One thing that is clear from this plot is that most of the counts across all samples have come within comparable range.

rna-seq deseq2 normalization • 2.2k views
modified 19 months ago by Kevin Blighe55k • written 19 months ago by Arindam Ghosh200
5
19 months ago by
Kevin Blighe55k
Kevin Blighe55k wrote:

There are different ways to gauge [graphically] how effective a normalisation has been. Looking at your second plot, it would appear in this case that normalisation has been successful.

Apart from box-and-whisker plots, one can also do:

# Violin plot

Using regularised log or variance stabilised counts:

``````require(reshape2)
violinMatrix <- reshape2::melt(loggedCounts)
colnames(violinMatrix) <- c("Gene","Sample","Expression")

library(ggplot2)
ggplot(violinMatrix, aes(x=Sample, y=Expression)) + geom_violin() + theme(axis.text.x = element_text(angle=45, hjust=1))
``````

# pairwise sample scatter plots

Using regularised log or variance stabilised counts:

``````require(car)
scatterplotMatrix(loggedCounts, diagonal="boxplot", pch=".")
``````

# Dispersion plot

Just looking at the unlogged, normalised counts, a dispersion plot gives a good idea of how good the modelling of dispersion dependent on the mean normalised counts has been.

``````options(scipen=999)
plotDispEsts(dds, genecol="black", fitcol="red", finalcol="dodgerblue", legend=TRUE, log="xy", cex.axis=0.8, cex=0.3, cex.main=0.8, xlab="Mean of normalised counts", ylab="Dispersion")
options(scipen=0)
``````

# Bootstrapped hierarchical clustering (unsupervised - i.e. entire dataset)

Using regularised log or variance stabilised counts:

``````require(pvclust)
pv <- pvclust(loggedCounts, method.dist="euclidean", method.hclust="ward.D2", nboot=100)
plot(pv)
``````

# Symmetrical sample heatmap

Using regularised log or variance stabilised counts:

``````require(gplots)
distsRL <- dist(t(loggedCounts))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
heatmap.2(mat, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col=rev(hmcol), cexRow=1.0, cexCol=1.0, margin=c(13, 13), key=FALSE)
``````

Kevin

Thank you very much. This is really going to help me.

I more thing I need to clear. For box/violin plots do the first/third quartile or median for all the samples be at the same levels. For the same data set i was trying a different low count filter and after that I saw that the third quartile and the median are at level across sample but the first quartile of group A is same but for group B its much lower.

1

It is normal to exclude low counts, but which cut-off did you use?

I was experimenting with several like:

``````1) rowSums(count) > 0

2) rowMeans(count) > 1

3) rowSums(count>0)>10

4) apply(count, 1, function(x){all(x>0)})
``````

The plots I provided above are from #4 and this is from DESeq2 manual (if I remember correctly). Initially it was good to see the number of genes getting reduced but after DE I realised I lost some important genes only because in 1 out of 45 samples it had 0 count.

The results of #1,2 & 3 were similar.

Hey, well, you definitely should exclude anything with just 0 counts across all samples (#1). I can see why #1, #2, #3 give similar results. #4 may be too stringent, as I think that it should be expected that some samples will return a 0 count value (#4 requires that all samples have counts>0 ? ).

So what's your opinion on using this. Something else I should try. Apart from this if I read correctly DESeq performs further cleaning.

1

I think that either of your first three are valid - I have seen them being used in various studies. #4 is too stringent - you would require a good reason for using such a threshold, like, for example, you needed to use some statistical test where zero vales were not permitted.