Question: Normalization Method In Comparing Differential Gene Expression
0
8.2 years ago by
camelbbs680
China
camelbbs680 wrote:

HI,

I see cuffdiff compare the RPKM from two samples but DESeq and edgeR compare the raw read counts. Now what is the normalization method in DESeq, as I know it use median of scaled counts, what is that? How to get it from the raw read counts.

Thanks, Che

rna-seq • 21k views
modified 2.4 years ago by ra296720 • written 8.2 years ago by camelbbs680

Hello! Thanks for the comments. Must you eliminate those probes that are always zero in all your samples before doing the normalization or it is not necessary?

Please do not add an answer unless you're answering the top level question. I'm moving this to a comment, but please be more careful in the future.

10
8.2 years ago by
bdemarest460
Salt Lake City, UT, USA
bdemarest460 wrote:

The R code that computes size factors in DESeq can be viewed by:

``````library(DESeq)
estimateSizeFactorsForMatrix

function (counts, locfunc = median)
{
loggeomeans <- rowMeans(log(counts))
apply(counts, 2, function(cnts) exp(locfunc((log(cnts) -
loggeomeans)[is.finite(loggeomeans)])))
}
<environment: namespace:DESeq>
``````

You can see how it works in the following example, using crudely simulated RNA-Seq data:

``````library(DESeq)
set.seed(31415926)

# Create some simulated count data.
# All samples have the same underlying count, but with varying numbers
# total reads per sample. Poisson distribution used for simplicity,
# but note that real RNA-Seq data has higher variance than poisson.

true_size_factors = c(1, 1.2, 1.5, 2.0)
true_mean_counts = c(10, 20, 100, 500, 1000, 5000)

e = expand.grid(true_size_factors, true_mean_counts)
lambdas = e\$Var1 * e\$Var2

raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")

raw_data
#        sample_1 sample_2 sample_3 sample_4
# gene_1       15        9       11       18
# gene_2       19       21       21       40
# gene_3      106      114      153      207
# gene_4      569      565      756      992
# gene_5     1029     1260     1559     1968
# gene_6     5049     5897     7537    10029

sizes = estimateSizeFactorsForMatrix(raw_data)
sizes
#  sample_1  sample_2  sample_3  sample_4
# 0.7605013 0.8659266 1.0771231 1.4414108

raw_data / do.call(rbind, rep(list(sizes), 6))
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022
``````

Compare this to the more automated method described in the DESeq vignette:

``````cds = newCountDataSet(raw_data, conditions=c("trt", "trt", "cont", "cont"))
cds = estimateSizeFactors(cds)
cts = counts(cds, normalized=TRUE)

cts
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022
``````

I would also add that the variance stabilizing transformation provided by DESeq is very useful. The Authors recommend using it for clustering, heatmaps, or other visualization.

``````# Use the variance stabilized data for clustering, heatmaps, etc.
cds = estimateDispersions(cds, method="blind")
vsd = getVarianceStabilizedData(cds)
vsd
#         sample_1  sample_2  sample_3  sample_4
# gene_1  7.414359  7.435870  7.422947  7.372632
# gene_2  7.544171  7.469999  7.607719  7.489916
# gene_3  8.379632  8.334064  8.315754  8.338468
# gene_4  9.814620  9.784286  9.816016  9.850681
# gene_5 10.494539 10.634479 10.620615 10.677170
# gene_6 12.715437 12.842110 12.778146 12.808379
``````

Thanks, very nicely explained

Hi, The code and explanation is definitely helpful! However, I found that I get different numbers for sizes when I paste the code in R. The values that I get match with what I get in Excel. Was wondering what could be the reason for the different sizes you get. P.S. I also want to note that the final numbers match with DESEQ normalization, so it is more of a curiosity and not a major concern. Thanks.

``````> set.seed(31415926)
>
> # Create some simulated count data.
> # All samples have the same underlying count, but with varying numbers
> # total reads per sample. Poisson distribution used for simplicity,
> # but note that real RNA-Seq data has higher variance than poisson.
>
> true_size_factors = c(1, 1.2, 1.5, 2.0)
> true_mean_counts = c(10, 20, 100, 500, 1000, 5000)
>
> e = expand.grid(true_size_factors, true_mean_counts)
> lambdas = e\$Var1 * e\$Var2
>
> raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
> rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
> colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")
>
> raw_data
sample_1 sample_2 sample_3 sample_4
gene_1       15        9       11       18
gene_2       19       21       21       40
gene_3      106      114      153      207
gene_4      569      565      756      992
gene_5     1029     1260     1559     1968
gene_6     5049     5897     7537    10029
> estimateSizeFactorsForMatrix = function (counts, locfunc = median)
+ {
+     loggeomeans <- rowMeans(log(counts))
+     apply(counts, 2, function(cnts) exp(locfunc((log(cnts) -
+         loggeomeans)[is.finite(loggeomeans)])))
+ }
> sizes = estimateSizeFactorsForMatrix(raw_data)
> sizes
sample_1  sample_2  sample_3  sample_4
0.7735959 0.8353574 1.0863588 1.4357334

> raw_data / do.call(rbind, rep(list(sizes), 6))
sample_1   sample_2   sample_3   sample_4
gene_1   19.38997   10.77383   10.12557   12.53715
gene_2   24.56063   25.13894   19.33063   27.86033
gene_3  137.02245  136.46853  140.83745  144.17719
gene_4  735.52615  676.35717  695.90269  690.93608
gene_5 1330.15186 1508.33635 1435.06918 1370.72804
gene_6 6526.66350 7059.25354 6937.85528 6985.28022
``````
ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by richieangel0
5
8.2 years ago by
Belgium, Brussels
Nicolas Rosewick9.3k wrote:

read the deseq paper it's well explained in it: http://genomebiology.com/2010/11/10/R106

and check also the R package with a very easy-reading vignette : http://bioconductor.org/packages/release/bioc/html/DESeq.html

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Nicolas Rosewick9.3k