Question: How to graphically tell if data has been normalized?
1
gravatar for Arindam Ghosh
9 months ago by
Arindam Ghosh160
India
Arindam Ghosh160 wrote:

Is there a method to tell if my RNA-seq data has been normalized? I used DESeq2 to normalize my feature count data and they plotted the log transformed normalized counts with box-plots. Is this a good way to analyse? I had referred some articles where they had used this method. For median normalization the median line would be at the same level and for quartile normalization the third quartile is at the same level. For DESeq2, it uses RLE method. So how can I graphically explain this? One thing that is clear from this plot is that most of the counts across all samples have come within comparable range.

boxplot_unnormalized

boxplot_normalized

rna-seq deseq2 normalization • 910 views
ADD COMMENTlink modified 9 months ago by Kevin Blighe41k • written 9 months ago by Arindam Ghosh160
5
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe41k
The Ether
Kevin Blighe41k wrote:

There are different ways to gauge [graphically] how effective a normalisation has been. Looking at your second plot, it would appear in this case that normalisation has been successful.

Apart from box-and-whisker plots, one can also do:

Violin plot

Using regularised log or variance stabilised counts:

require(reshape2)
violinMatrix <- reshape2::melt(loggedCounts)
colnames(violinMatrix) <- c("Gene","Sample","Expression")

library(ggplot2)
ggplot(violinMatrix, aes(x=Sample, y=Expression)) + geom_violin() + theme(axis.text.x = element_text(angle=45, hjust=1))

violin


pairwise sample scatter plots

Using regularised log or variance stabilised counts:

require(car)
scatterplotMatrix(loggedCounts, diagonal="boxplot", pch=".")

f


Dispersion plot

Just looking at the unlogged, normalised counts, a dispersion plot gives a good idea of how good the modelling of dispersion dependent on the mean normalised counts has been.

options(scipen=999)
plotDispEsts(dds, genecol="black", fitcol="red", finalcol="dodgerblue", legend=TRUE, log="xy", cex.axis=0.8, cex=0.3, cex.main=0.8, xlab="Mean of normalised counts", ylab="Dispersion")
options(scipen=0)

g

------------------------------

-------------------------------

More for outlier detection:

Bootstrapped hierarchical clustering (unsupervised - i.e. entire dataset)

Using regularised log or variance stabilised counts:

require(pvclust)
pv <- pvclust(loggedCounts, method.dist="euclidean", method.hclust="ward.D2", nboot=100)
plot(pv)

dend

Principal components analysis

Symmetrical sample heatmap

Using regularised log or variance stabilised counts:

require(gplots)
distsRL <- dist(t(loggedCounts))
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds), paste(metadata$IDlist, metadata$condition, sep=", "))
hc <- hclust(distsRL)
heatmap.2(mat, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col=rev(hmcol), cexRow=1.0, cexCol=1.0, margin=c(13, 13), key=FALSE)

d

Kevin

ADD COMMENTlink modified 9 months ago • written 9 months ago by Kevin Blighe41k

Thank you very much. This is really going to help me.

I more thing I need to clear. For box/violin plots do the first/third quartile or median for all the samples be at the same levels. For the same data set i was trying a different low count filter and after that I saw that the third quartile and the median are at level across sample but the first quartile of group A is same but for group B its much lower.

ADD REPLYlink written 9 months ago by Arindam Ghosh160
1

It is normal to exclude low counts, but which cut-off did you use?

ADD REPLYlink written 9 months ago by Kevin Blighe41k

I was experimenting with several like:

1) rowSums(count) > 0

2) rowMeans(count) > 1

3) rowSums(count>0)>10

4) apply(count, 1, function(x){all(x>0)})

The plots I provided above are from #4 and this is from DESeq2 manual (if I remember correctly). Initially it was good to see the number of genes getting reduced but after DE I realised I lost some important genes only because in 1 out of 45 samples it had 0 count.

The results of #1,2 & 3 were similar.

ADD REPLYlink modified 9 months ago • written 9 months ago by Arindam Ghosh160

Hey, well, you definitely should exclude anything with just 0 counts across all samples (#1). I can see why #1, #2, #3 give similar results. #4 may be too stringent, as I think that it should be expected that some samples will return a 0 count value (#4 requires that all samples have counts>0 ? ).

ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe41k

So what's your opinion on using this. Something else I should try. Apart from this if I read correctly DESeq performs further cleaning.

ADD REPLYlink written 9 months ago by Arindam Ghosh160
1

I think that either of your first three are valid - I have seen them being used in various studies. #4 is too stringent - you would require a good reason for using such a threshold, like, for example, you needed to use some statistical test where zero vales were not permitted.

ADD REPLYlink modified 8 months ago • written 8 months ago by Kevin Blighe41k

I guess I need a bit more of help regarding this. I going through the graphs generated by the first three methods and something feels bad is going on; especially if you see box plot of normalized counts and density plot. Please see the attached files. The plots are similar for all of them so representing only the ones generated by #3 method. For normalized count boxplot the median and upper quartile are in range but not the lower one. And for density I guess the curves should have overlapped.

boxplot_normalized density Dispersion ecdf MA pca sample_heatmap1

ADD REPLYlink modified 8 months ago • written 8 months ago by Arindam Ghosh160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2040 users visited in the last hour