Question

Is it correct to use DESeq2's baseMean for interpretation of transcript expression across all samples?

2

Entering edit mode

3.7 years ago

kehoe ▴ 40

Example:

baseMean

Gene A 20000.00

Gene B 80.00

Gene C 0

Refresher

baseMean: 'The values above are the average of the normalized count values, dividing by size factors, taken over all samples, normalizing for sequencing depth. It does not take into account gene length. The base mean is used in DESeq2 only for estimating the dispersion of a gene (it is used to estimate the fitted dispersion). For this task, the range of counts for a gene is relevant but not the gene's length (or other technical factors influencing the count, like sequence content).'
Gene length: 'Accounting for gene length is necessary for comparing expression between different genes within the same sample.'

My questions:

Is the baseMean value the final dispersion estimates before fitting the GLM model and testing?
Observation: 'Gene A' has the highest transcript count value, 'Gene B' the lowest, 'Gene C' was not identified in the data across all samples. Is this correct without making a comparison of a single gene between the samples? For comparison of a 'Gene A' between samples the log2FoldChange value is used and padj estimates significance.
Would a combination of baseMean and log2FoldChange be useful to determine if a gene is highly present (expressed?) in all samples and differentially expressed between samples? Essentially, does baseMean = level of transcript (expression?) overall?

Thank you in advance!

RNA-Seq DESeq2 baseMean • 8.3k views

ADD COMMENT • link updated 3.7 years ago by swbarnes2 14k • written 3.7 years ago by kehoe ▴ 40

score 4 · Answer 1 · 2020-08-14

Hi, so in your observation Gene A would show not the highest transcript count, but highest normalized read count. The amount of reads a gene gets during RNA-seq depends on the expression, but also the gene length, like you mentioned. So therefore, A gene X could theoretically have the same normalized read count (where the normalization does not take into account the gene length) as another gene Y, which is half as short as gene X, but twice as much expressed. To address your last question, I would say, that baseMean does not represent the level of transcript expression (since it does not consider gene length), however, it can give you a quick approximation. Using your example, it would probably be save to assume, that gene B is not 250 times shorter (20000/80) and equally expressed as gene A, which could explain the differences in the normalized read counts. Its more likely, that gene A is indeed higher expressed, however this approximation gets tricky if the baseMean values would be more similar, like 1000 and 2000.

If you, for example use salmon count matrices, they already contain another form of count normalization, which consideres gene length and would be suited better to find highly expressed genes in your data, called TPM (or transcripts per million).

score 2 · Answer 2 · 2020-08-14

I would note that some library prep methods are biased to one end of the transcript, in which case correcting for transcript length is not appropriate; long genes and short genes both have exactly one 3' end.

So you should find out what library prep was used before you consider adjusting for transcript length.