Question

Trouble with Cyclin D3 in the Golub dataset

1

Entering edit mode

5.3 years ago

fuellen ▴ 20

Dear ALL,

Many of you will be aware of the Golub et al data, one of the first high-throughput gene expression datasets ever, used for teaching in many places. I recently noticed that for one of the genes, the expression pattern in their famous figure does not seem to match the underlying data (as they are available via the R package, and described in Wim Krijnen's wonderful book "Applied Statistics for Bioinformatics using R".) But maybe I'm just missing or misunderstanding something.

From the Figure of the original publication, it is clear that expression values of Cyclin D3 for AML are mostly below 0, cf. the forth row.

However, inspecting the data, and concordant with e.g. Figure 2.4 of the book, "Boxplot of ALL and AML expression values of gene CCND3 Cyclin D3", it is clear that for AML, expression values of Cyclin D3 are above 0.

I checked that there is only one Cyclin D3 in the dataset; I'm not sure about the possibility that the data were normalized differently; I checked some other genes, and I found no such problem for any other gene, so I'm not sure that it's a normalization issue.

Can anyone help and shed some light on this issue?

Thanks!! georg

R gene expression • 1.1k views

ADD COMMENT • link updated 2.5 years ago by Victor ▴ 10 • written 5.3 years ago by fuellen ▴ 20

score 1 · Answer 1 · 2019-09-11

It seems there might be some misunderstanding instead of definite answer whether raw expression data from this paper is available. Unless i am wrong, only various normalized data can be found along with some rescaling factors.

From the Figure of the original publication, it is clear that expression values of Cyclin D3 for AML are mostly below 0, cf. the forth row.

that figure's caption (3B from Golub et al) specifically says the color scale indicates "SDs above or below the mean.", so definitely not the raw, or even normalized expression level.

In my answer i will try to reproduce the plot based on those assumptions using a copy of the data from cancer program legacy publication resources. Original figure 3B (see below) is a 50(genes) x 38(samples) heatmap, which hints that the authors used only the train dataset, and not the complete set of 72 samples to plot it. The train dataset i am referring to may be found here and is a 7129(genes) x 38(samples) normalized expression value matrix. We can import it and select only the 50 genes appearing in the plot 3B. Afterwards, we can calculate the means, SDs for every row (gene) across 38 samples. I hope that the rest of the code along with the comments is self-explanatory.

# get the data
data <- read.table("https://pubs.broadinstitute.org/mpr/projects/Leukemia/data_set_ALL_AML_train.txt", sep="\t", header=TRUE, row.names=NULL, quote="", fill=FALSE, comment.char="")

# remove unnecessary columns
data = data[ , seq(1, ncol(data) - 1, 2) ]

# rename columns a bit
colnames(data) <- c("Gene", sprintf("sample%s", seq(1, ncol(data)-1)))

# select the 50 genes, in order as they appear in figure 3B
data = data[rev(c(5772,4328,2642,2354,6281,4535,5593,2441,6855,1630,3056,5501,4177,4389,2348,7119,5191,2909,5254,6515,1928,4973,3096,1909,1704,2020,4847,3320,1745,3847,1834,2288,5039,1882,4196,6201,2402,3258,2242,1249,2111,6200,2121,2043,2186,6373,6539,6803,6376,4052)),]

# calculate the per-gene standard deviations
sds = outer(1:nrow(data), 2:ncol(data) , FUN=function(row_nr,col_nr) apply(data[,-1],1,sd)[row_nr] )

# calculate the per-gene means
means = outer(1:nrow(data), 2:ncol(data) , FUN=function(row_nr,col_nr) apply(data[,-1],1,mean)[row_nr] )

# calculate the per-sample differences to a per-gene standard deviation
diffs = ( as.matrix(data[,-1]) - as.matrix(means) ) / as.matrix(sds)

# plot
heatmap(diffs, Rowv=NA, Colv=NA, labRow=data[,1],  margins = c(4,4), col = colorRampPalette(c("blue", "lightblue", "firebrick1", "red"))(n = 100) )

This results in a following plot: enter image description here

Original plot for comparison:

figure 3B from Golub et al.