Normalization Method In Comparing Differential Gene Expression
2
0
Entering edit mode
11.8 years ago
camelbbs ▴ 710

HI,

I see cuffdiff compare the RPKM from two samples but DESeq and edgeR compare the raw read counts. Now what is the normalization method in DESeq, as I know it use median of scaled counts, what is that? How to get it from the raw read counts.

Thanks, Che

rna-seq • 23k views
0
Entering edit mode

Hello! Thanks for the comments. Must you eliminate those probes that are always zero in all your samples before doing the normalization or it is not necessary?

0
Entering edit mode

10
Entering edit mode
11.8 years ago
bdemarest ▴ 460

The R code that computes size factors in DESeq can be viewed by:

library(DESeq)
estimateSizeFactorsForMatrix

function (counts, locfunc = median)
{
loggeomeans <- rowMeans(log(counts))
apply(counts, 2, function(cnts) exp(locfunc((log(cnts) -
loggeomeans)[is.finite(loggeomeans)])))
}
<environment: namespace:DESeq>


You can see how it works in the following example, using crudely simulated RNA-Seq data:

library(DESeq)
set.seed(31415926)

# Create some simulated count data.
# All samples have the same underlying count, but with varying numbers
# total reads per sample. Poisson distribution used for simplicity,
# but note that real RNA-Seq data has higher variance than poisson.

true_size_factors = c(1, 1.2, 1.5, 2.0)
true_mean_counts = c(10, 20, 100, 500, 1000, 5000)

e = expand.grid(true_size_factors, true_mean_counts)
lambdas = e$Var1 * e$Var2

raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")

raw_data
#        sample_1 sample_2 sample_3 sample_4
# gene_1       15        9       11       18
# gene_2       19       21       21       40
# gene_3      106      114      153      207
# gene_4      569      565      756      992
# gene_5     1029     1260     1559     1968
# gene_6     5049     5897     7537    10029

sizes = estimateSizeFactorsForMatrix(raw_data)
sizes
#  sample_1  sample_2  sample_3  sample_4
# 0.7605013 0.8659266 1.0771231 1.4414108

raw_data / do.call(rbind, rep(list(sizes), 6))
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022


Compare this to the more automated method described in the DESeq vignette:

cds = newCountDataSet(raw_data, conditions=c("trt", "trt", "cont", "cont"))
cds = estimateSizeFactors(cds)
cts = counts(cds, normalized=TRUE)

cts
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022


I would also add that the variance stabilizing transformation provided by DESeq is very useful. The Authors recommend using it for clustering, heatmaps, or other visualization.

# Use the variance stabilized data for clustering, heatmaps, etc.
cds = estimateDispersions(cds, method="blind")
vsd = getVarianceStabilizedData(cds)
vsd
#         sample_1  sample_2  sample_3  sample_4
# gene_1  7.414359  7.435870  7.422947  7.372632
# gene_2  7.544171  7.469999  7.607719  7.489916
# gene_3  8.379632  8.334064  8.315754  8.338468
# gene_4  9.814620  9.784286  9.816016  9.850681
# gene_5 10.494539 10.634479 10.620615 10.677170
# gene_6 12.715437 12.842110 12.778146 12.808379

0
Entering edit mode

Thanks, very nicely explained

0
Entering edit mode

Hi, The code and explanation is definitely helpful! However, I found that I get different numbers for sizes when I paste the code in R. The values that I get match with what I get in Excel. Was wondering what could be the reason for the different sizes you get. P.S. I also want to note that the final numbers match with DESEQ normalization, so it is more of a curiosity and not a major concern. Thanks.

> set.seed(31415926)
>
> # Create some simulated count data.
> # All samples have the same underlying count, but with varying numbers
> # total reads per sample. Poisson distribution used for simplicity,
> # but note that real RNA-Seq data has higher variance than poisson.
>
> true_size_factors = c(1, 1.2, 1.5, 2.0)
> true_mean_counts = c(10, 20, 100, 500, 1000, 5000)
>
> e = expand.grid(true_size_factors, true_mean_counts)
> lambdas = e$Var1 * e$Var2
>
> raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
> rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
> colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")
>
> raw_data
sample_1 sample_2 sample_3 sample_4
gene_1       15        9       11       18
gene_2       19       21       21       40
gene_3      106      114      153      207
gene_4      569      565      756      992
gene_5     1029     1260     1559     1968
gene_6     5049     5897     7537    10029
> estimateSizeFactorsForMatrix = function (counts, locfunc = median)
+ {
+     loggeomeans <- rowMeans(log(counts))
+     apply(counts, 2, function(cnts) exp(locfunc((log(cnts) -
+         loggeomeans)[is.finite(loggeomeans)])))
+ }
> sizes = estimateSizeFactorsForMatrix(raw_data)
> sizes
sample_1  sample_2  sample_3  sample_4
0.7735959 0.8353574 1.0863588 1.4357334

> raw_data / do.call(rbind, rep(list(sizes), 6))
sample_1   sample_2   sample_3   sample_4
gene_1   19.38997   10.77383   10.12557   12.53715
gene_2   24.56063   25.13894   19.33063   27.86033
gene_3  137.02245  136.46853  140.83745  144.17719
gene_4  735.52615  676.35717  695.90269  690.93608
gene_5 1330.15186 1508.33635 1435.06918 1370.72804
gene_6 6526.66350 7059.25354 6937.85528 6985.28022

5
Entering edit mode
11.8 years ago

read the deseq paper it's well explained in it: http://genomebiology.com/2010/11/10/R106

and check also the R package with a very easy-reading vignette : http://bioconductor.org/packages/release/bioc/html/DESeq.html