Question: Normalization Method In Comparing Differential Gene Expression
0
gravatar for camelbbs
6.4 years ago by
camelbbs650
China
camelbbs650 wrote:

HI,

I see cuffdiff compare the RPKM from two samples but DESeq and edgeR compare the raw read counts. Now what is the normalization method in DESeq, as I know it use median of scaled counts, what is that? How to get it from the raw read counts.

Thanks, Che

rna-seq • 19k views
ADD COMMENTlink modified 8 months ago by ra296710 • written 6.4 years ago by camelbbs650

Hello! Thanks for the comments. Must you eliminate those probes that are always zero in all your samples before doing the normalization or it is not necessary?

ADD REPLYlink written 8 months ago by ra296710

Please do not add an answer unless you're answering the top level question. I'm moving this to a comment, but please be more careful in the future.

ADD REPLYlink written 8 months ago by RamRS21k
10
gravatar for bdemarest
6.4 years ago by
bdemarest460
Salt Lake City, UT, USA
bdemarest460 wrote:

The R code that computes size factors in DESeq can be viewed by:

library(DESeq)
estimateSizeFactorsForMatrix

function (counts, locfunc = median) 
{
    loggeomeans <- rowMeans(log(counts))
    apply(counts, 2, function(cnts) exp(locfunc((log(cnts) - 
        loggeomeans)[is.finite(loggeomeans)])))
}
<environment: namespace:DESeq>

You can see how it works in the following example, using crudely simulated RNA-Seq data:

library(DESeq)
set.seed(31415926)

# Create some simulated count data.
# All samples have the same underlying count, but with varying numbers
# total reads per sample. Poisson distribution used for simplicity,
# but note that real RNA-Seq data has higher variance than poisson.

true_size_factors = c(1, 1.2, 1.5, 2.0)
true_mean_counts = c(10, 20, 100, 500, 1000, 5000)

e = expand.grid(true_size_factors, true_mean_counts)
lambdas = e$Var1 * e$Var2

raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")

raw_data
#        sample_1 sample_2 sample_3 sample_4
# gene_1       15        9       11       18
# gene_2       19       21       21       40
# gene_3      106      114      153      207
# gene_4      569      565      756      992
# gene_5     1029     1260     1559     1968
# gene_6     5049     5897     7537    10029

sizes = estimateSizeFactorsForMatrix(raw_data)
sizes
#  sample_1  sample_2  sample_3  sample_4 
# 0.7605013 0.8659266 1.0771231 1.4414108 

raw_data / do.call(rbind, rep(list(sizes), 6))
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022

Compare this to the more automated method described in the DESeq vignette:

cds = newCountDataSet(raw_data, conditions=c("trt", "trt", "cont", "cont"))
cds = estimateSizeFactors(cds)
cts = counts(cds, normalized=TRUE)

cts
#          sample_1   sample_2   sample_3   sample_4
# gene_1   19.38997   10.77383   10.12557   12.53715
# gene_2   24.56063   25.13894   19.33063   27.86033
# gene_3  137.02245  136.46853  140.83745  144.17719
# gene_4  735.52615  676.35717  695.90269  690.93608
# gene_5 1330.15186 1508.33635 1435.06918 1370.72804
# gene_6 6526.66350 7059.25354 6937.85528 6985.28022

I would also add that the variance stabilizing transformation provided by DESeq is very useful. The Authors recommend using it for clustering, heatmaps, or other visualization.

# Use the variance stabilized data for clustering, heatmaps, etc.
cds = estimateDispersions(cds, method="blind")
vsd = getVarianceStabilizedData(cds)
vsd
#         sample_1  sample_2  sample_3  sample_4
# gene_1  7.414359  7.435870  7.422947  7.372632
# gene_2  7.544171  7.469999  7.607719  7.489916
# gene_3  8.379632  8.334064  8.315754  8.338468
# gene_4  9.814620  9.784286  9.816016  9.850681
# gene_5 10.494539 10.634479 10.620615 10.677170
# gene_6 12.715437 12.842110 12.778146 12.808379
ADD COMMENTlink written 6.4 years ago by bdemarest460

Thanks, very nicely explained

ADD REPLYlink written 5.9 years ago by Suren90

Hi, The code and explanation is definitely helpful! However, I found that I get different numbers for sizes when I paste the code in R. The values that I get match with what I get in Excel. Was wondering what could be the reason for the different sizes you get. P.S. I also want to note that the final numbers match with DESEQ normalization, so it is more of a curiosity and not a major concern. Thanks.

> set.seed(31415926)
>
> # Create some simulated count data.
> # All samples have the same underlying count, but with varying numbers
> # total reads per sample. Poisson distribution used for simplicity,
> # but note that real RNA-Seq data has higher variance than poisson.
>
> true_size_factors = c(1, 1.2, 1.5, 2.0)
> true_mean_counts = c(10, 20, 100, 500, 1000, 5000)
>
> e = expand.grid(true_size_factors, true_mean_counts)
> lambdas = e$Var1 * e$Var2
>
> raw_data = matrix(rpois(n=nrow(e), lambda=lambdas), ncol=4, byrow=TRUE)
> rownames(raw_data) = paste("gene", 1:nrow(raw_data), sep="_")
> colnames(raw_data) = paste("sample", 1:ncol(raw_data), sep="_")
>
> raw_data
       sample_1 sample_2 sample_3 sample_4
gene_1       15        9       11       18
gene_2       19       21       21       40
gene_3      106      114      153      207
gene_4      569      565      756      992
gene_5     1029     1260     1559     1968
gene_6     5049     5897     7537    10029
> estimateSizeFactorsForMatrix = function (counts, locfunc = median)
+ {
+     loggeomeans <- rowMeans(log(counts))
+     apply(counts, 2, function(cnts) exp(locfunc((log(cnts) -
+         loggeomeans)[is.finite(loggeomeans)])))
+ }
> sizes = estimateSizeFactorsForMatrix(raw_data)
> sizes
 sample_1  sample_2  sample_3  sample_4
0.7735959 0.8353574 1.0863588 1.4357334

> raw_data / do.call(rbind, rep(list(sizes), 6))
         sample_1   sample_2   sample_3   sample_4
gene_1   19.38997   10.77383   10.12557   12.53715
gene_2   24.56063   25.13894   19.33063   27.86033
gene_3  137.02245  136.46853  140.83745  144.17719
gene_4  735.52615  676.35717  695.90269  690.93608
gene_5 1330.15186 1508.33635 1435.06918 1370.72804
gene_6 6526.66350 7059.25354 6937.85528 6985.28022
ADD REPLYlink modified 19 months ago • written 19 months ago by richieangel0
5
gravatar for Nicolas Rosewick
6.4 years ago by
Belgium, Brussels
Nicolas Rosewick7.5k wrote:

read the deseq paper it's well explained in it: http://genomebiology.com/2010/11/10/R106

and check also the R package with a very easy-reading vignette : http://bioconductor.org/packages/release/bioc/html/DESeq.html

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Nicolas Rosewick7.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 909 users visited in the last hour