I have an RNA-seq dataset that have been batch-corrected for the purpose of making a heatmap to look at in a paper. For the differential expression analysis, I added the batch effect as an element in the design matrix in EdgeR. Here I am just interested in visualization.
The R script that I have been trying to use with looks like this (edit: should clarify that the input file only contains data for differentially expressed genes):
# Loading required R libraries. library(gplots) library(limma) library(edgeR) library(RColorBrewer) # Loading dataset. data <- read.table("genes.counts.matrix", header=T, sep="\t") # Row names are now gene names, and gene name column is gone. rownames(data) <- data[,1] data <- data[,2:length(data[1,])] # Define experimental factors and design matrix. batch <- factor(substring(colnames(data), 13, 13)) factorA <- factor(substring(colnames(data), 1, 6)) factorB <- factor(substring(colnames(data), 7, 11)) # Making the design matrix design <- model.matrix(~factorA*factorB) # Put counts into DGEList. y <- DGEList(counts=data) # TMM-normalization. y <- calcNormFactors(y) # Convert to CPM and log2 transformation. data <- cpm(y, log=TRUE, prior.count=4) # Remove batch effect. data <- removeBatchEffect(data, batch=batch, design=design) # Scale the data. data <- t(scale(t(data), scale=F)) # Calculate correlations between samples. cr = cor(data, method='pearson') # Calculate correlations between genes. gene_cor = cor(t(data), method='pearson') # Find gene distances: gene_dist = as.dist(1-gene_cor) # hclust the data hc_genes = hclust(gene_dist, method='complete') hc_samples = hclust(as.dist(1-cr), method='complete') # heat map colors and breaks. myheatcol = colorRampPalette(c('red', 'yellow'))(n = 75) quantBrks = quantile(data, c(0.03, 0.97)) # Make heatmap heatmap.2(data, key.xlab="log2 centered TMM-norm CPM", margins=c(10,5), lmat=rbind(c(4,3),c(2,1)), lhei=c(1.3,5), lwid=c(2.5,5), labRow=NA, dendrogram='both', Rowv=as.dendrogram(hc_genes), Colv=as.dendrogram(hc_samples), col=myheatcol, scale="none", density.info="none", trace="none", key=TRUE, keysize=1.2, cexCol=1, breaks=seq(quantBrks, quantBrks, length=76))
The batch removal worked very nicely and everything up to and including the application of removeBatchEffect() works without any problems or ambiguities. My issue is with what happens after the batch correction.
Although I probably won't be allowed to show the heat map since it is close to a publication grade image, the above code gives a heatmap that splits the samples first on factor A so that all A+ samples cluster together and all A- samples cluster together (with no sign of a batch effect).
Within each of these sub-clusters, B+ and B- samples cluster moderately, but far from perfectly. I am happy with this result from a scientific standpoint and it looks what I expect because a DE analysis did not find any signal from factor B.
This code appears to scale the mean only and not scale by SD due to:
data <- t(scale(t(data), scale=F))
This, as I have understood it, is not the most usual method. Instead, scaling is often used both for means and SDs. If I instead use
data <- t(scale(t(data), scale=T))
Which seems to scale both means and SDs, but this makes the sub-clustering due to B perfect i.e. all A+/B+ samples cluster together, all A+/B- samples cluster together etc. which would be expected in an experiment where A had the biggest effect and B had a smaller, but discernible effect.
Clearly, this result looks way too good to be true, which makes me highly suspicious that it is just a result of switching F to T and not anything biological, which might give a misleading impression.
If I instead comment out that line and use scale by row in the heatmap.2 function:
#data <- t(scale(t(data), scale=F))
heatmap.2(data, key.xlab="log2 centered TMM-norm CPM", margins=c(10,5), lmat=rbind(c(4,3),c(2,1)), lhei=c(1.3,5), lwid=c(2.5,5), labRow=NA, dendrogram='both', Rowv=as.dendrogram(hc_genes), Colv=as.dendrogram(hc_samples), col=myheatcol, scale="row", density.info="none", trace="none", key=TRUE, keysize=1.2, cexCol=1, breaks=seq(quantBrks, quantBrks, length=76))
It complains that I should not use custom breaks together with scale="row". Removing custom breaks provides a similar but not identical clustering to using
data <- t(scale(t(data), scale=F))
together with scale="none" in the heatmap.2() function (but the lack of custom breaks make the color differences very weak).
I know there is no single, perfect way to do this, but there are surely methods that are way off and deceptive and I want to avoid the latter. I definitely do not want to arbitrarily pick a method that produces results that looks especially good or otherwise conforms to researcher bias. Then there would be no point with scientific research if you get to "pick your own results".
My questions are:
(1) Is it legitimate to use the first script in this question with data <- t(scale(t(data), scale=F))? What is the issue with scaling means but not SDs? How bad is this?
(2) Should I instead use data <- t(scale(t(data), scale=T))? Will this scale both means and SDs (this is the impression I get from reading the manual)? Why does it produce such considerable change in clustering (from complete clustering on A and partial sub-clustering of B to complete clustering of A and then complete sub-clustering on B)?
(3) If I skip custom clustering with data <- t(scale(t(data)) and just let heatmap.2() do it, then I cannot use custom breaks. How to make the differences more visually distinct if it does not allow me to use breaks?
(4) Are any of these options inadvisable in the "this is misleading" sense? If so, which?