Is it correct to use DESeq2's baseMean for interpretation of transcript expression across all samples?
2
2
Entering edit mode
3.7 years ago
kehoe ▴ 40

Example:

baseMean

Gene A 20000.00

Gene B 80.00

Gene C 0

Refresher

  • baseMean: 'The values above are the average of the normalized count values, dividing by size factors, taken over all samples, normalizing for sequencing depth. It does not take into account gene length. The base mean is used in DESeq2 only for estimating the dispersion of a gene (it is used to estimate the fitted dispersion). For this task, the range of counts for a gene is relevant but not the gene's length (or other technical factors influencing the count, like sequence content).'
  • Gene length: 'Accounting for gene length is necessary for comparing expression between different genes within the same sample.'

My questions:

  • Is the baseMean value the final dispersion estimates before fitting the GLM model and testing?
  • Observation: 'Gene A' has the highest transcript count value, 'Gene B' the lowest, 'Gene C' was not identified in the data across all samples. Is this correct without making a comparison of a single gene between the samples? For comparison of a 'Gene A' between samples the log2FoldChange value is used and padj estimates significance.
  • Would a combination of baseMean and log2FoldChange be useful to determine if a gene is highly present (expressed?) in all samples and differentially expressed between samples? Essentially, does baseMean = level of transcript (expression?) overall?

Thank you in advance!

RNA-Seq DESeq2 baseMean • 8.3k views
ADD COMMENT
4
Entering edit mode
3.7 years ago
caggtaagtat ★ 1.9k

Hi, so in your observation Gene A would show not the highest transcript count, but highest normalized read count. The amount of reads a gene gets during RNA-seq depends on the expression, but also the gene length, like you mentioned. So therefore, A gene X could theoretically have the same normalized read count (where the normalization does not take into account the gene length) as another gene Y, which is half as short as gene X, but twice as much expressed. To address your last question, I would say, that baseMean does not represent the level of transcript expression (since it does not consider gene length), however, it can give you a quick approximation. Using your example, it would probably be save to assume, that gene B is not 250 times shorter (20000/80) and equally expressed as gene A, which could explain the differences in the normalized read counts. Its more likely, that gene A is indeed higher expressed, however this approximation gets tricky if the baseMean values would be more similar, like 1000 and 2000.

If you, for example use salmon count matrices, they already contain another form of count normalization, which consideres gene length and would be suited better to find highly expressed genes in your data, called TPM (or transcripts per million).

ADD COMMENT
0
Entering edit mode

Very interesting. I have also used salmon count matrices for this dataset however, did not consider gene length so thoroughly before posting. Thank you for this input!

ADD REPLY
0
Entering edit mode

Perfect, then you can also gather the TPM information after using the tximport function :) Glad, if that helped

ADD REPLY
2
Entering edit mode
3.7 years ago

I would note that some library prep methods are biased to one end of the transcript, in which case correcting for transcript length is not appropriate; long genes and short genes both have exactly one 3' end.

So you should find out what library prep was used before you consider adjusting for transcript length.

ADD COMMENT

Login before adding your answer.

Traffic: 2648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6