Question: Why are DESeq2 size factors not null in case of sparse count matrix?
0
2.9 years ago by
holgerbrandl30
holgerbrandl30 wrote:

Hi,

I've followed the vignette to just perform size factorn normalization using

``````DESeqDataSetFromMatrix(countData = countMatrix, colData = colData, design = ~ 1)
dds = estimateSizeFactors(dds)
sizeFactors(dds)
``````

I've read the docs of `DESeq2::estimateSizeFactors` but I still struggle how size factors are calculated by DESeq2. The default type seems to be "ratio" but if so, I would expect many size-factors to become 0 if my count matrix is sparse. However, size factors calculated by DESeq2 seem to be never 0 even on very sparse data.

Is it internally using an altered implementation of the geometric-mean to exclude zeros when running with "ratio", or is it applying some offset addition and/or filtering before calculating the median of the ratios?

To understand it better I've created a clean-room reimplemntaiton of the normalization:

``````require(tidyverse)

data <- tribble(
~ ensembl_gene_id, ~ sample_A, ~ sample_B, ~ sample_C,
"gene_1", 50, 500, 0,
"gene_2", 30, 0, 10,
"gene_3", 0, 0, 0,
"gene_4", 0, 1000, 0,
"gene_5", 40, 0, 0,
"gene_6", 100, 0, 100,
)

dataLong = gather(data, sample, num_algn, -ensembl_gene_id)

gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) }
# gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x+0.5), na.rm=na.rm) / length(x)) }

size_factors = dataLong %>%
group_by(ensembl_gene_id) %>%
mutate(gm_norm_algn=num_algn/gm_mean(num_algn)) %>%
group_by(sample) %>%
summarize(size_fac=median(gm_norm_algn))

norm_data = dataLong %>% left_join(size_factors)  %>%
mutate(size_norm=num_algn/size_fac) %>%
select(ensembl_gene_id, sample, size_norm)
``````

But using it I get entirely different results caused by diminishing size factors.

Thanks & best regards.

Holger

rna-seq deseq deseq2 • 1.7k views
modified 2.9 years ago by Kevin Blighe69k • written 2.9 years ago by holgerbrandl30
1

I don't really understand what you mean by saying that many size factors should be 0. Size factors are computed per sample, not per gene. To answer your question, I think DESeq2 only uses genes that have > 0 counts in each sample to compute the size factors, If I remember correctly.

Yes, the size factors can never be 0 because they are used to divide the read counts by genes. By the way, size factors are distributed around 1 and the mean of all size factors in your study is equal to 1.

Indeed, when excluding all 0 counts from the data, I get factors that come very close to what DESeq2 is providing. Somehow this tiny but important details seems to be left out very often when explaining size-factor normalization.

Some examples:

``````# A tibble: 36 x 3
sample        size_fac size_fac_deseq
<chr>            <dbl>          <dbl>
1 P525L-0_1        1.00           1.03
2 P525L-0_2        0.939          0.953
3 P525L-1+3h_1     0.949          0.927
4 P525L-1+3h_2     1.27           1.28
5 P525L-1h_1       0.962          0.970
6 P525L-1h_2       1.09           1.11
7 P525L-2h30m_1    0.841          0.825
8 P525L-2h30m_2    1.14           1.14
9 P525L-30m_1      0.899          0.906
10 P525L-30m_2      0.882          0.879
# ... with 26 more rows
``````

As you can see, there are still small differences, but these could be rounding errors.

The "ratio" about which you speak, and the size factor, is calculated as follows:

1. Per sample, divide each gene's value against the geometric mean of its values across all samples.
2. We then have a ratio for each gene in each sample
3. For each gene, choose the geometric median of the ratios, which is then the sizing factor

From this, I cannot see how we would end up with zeros for sizing factors.

Note, however, that if you use this method and each of your genes has at least 1 zero across all samples, then an error is thrown.

Thanks for detailing out the definition again. I've tried to implement exactly this protocol in my code from above. To illustrate the problem better, I've updated the example data to become a sparse count matrix, which results in following size factors

``````  sample   size_fac
<chr>       <dbl>
1 sample_A     3.10
2 sample_B     0
3 sample_C     0
``````

It could well be that my implementation is incorrect, but I can't find what could be wrong.

See @Martombo's comment, which was imho the best pointer so far.

1
2.9 years ago by
Martombo2.7k
Seville, ES
Martombo2.7k wrote:

I'm pretty sure that genes with no counts in even only one sample are excluded from the computation. I can't find any reference for this at the moment, but if you look at the code (around line 500), you'll see that only genes with a finite loggeomean are kept (all non-zero counts). Notice also that if "every gene contains at least one zero, cannot compute log geometric means".

Thanks Martombo. If I get a spare couple of hours I will take a look

Indeed, genes with no counts in even only one sample are excluded. I obtain the exactly same factors as provided by DESeq2 when adopting my implementation from above to include such a filtering step:

``````size_factors = dataLong %>%
group_by(ensembl_gene_id) %>%
filter(!any(num_algn==0)) %>%
mutate(gm_norm_algn = num_algn / gm_mean(num_algn)) %>%
group_by(sample) %>%
summarize(size_fac = median(gm_norm_algn))
``````

Maybe you could change your comment into answer so that I could accept it?

Personally I find this filtering rather harsh, since in large experimental setups there will often be a sample with a 0 count for any given gene. So the more samples are part of the experiment, the more genes are likely to be excluded when calculating size factors.