Question: Is it correct to use DESeq2's baseMean for interpretation of transcript expression across all samples?
2
gravatar for kehoe
5 weeks ago by
kehoe20
Leibniz Institute for Zoo and Wildlife Research
kehoe20 wrote:

Example:

baseMean

Gene A 20000.00

Gene B 80.00

Gene C 0

Refresher

  • baseMean: 'The values above are the average of the normalized count values, dividing by size factors, taken over all samples, normalizing for sequencing depth. It does not take into account gene length. The base mean is used in DESeq2 only for estimating the dispersion of a gene (it is used to estimate the fitted dispersion). For this task, the range of counts for a gene is relevant but not the gene's length (or other technical factors influencing the count, like sequence content).'
  • Gene length: 'Accounting for gene length is necessary for comparing expression between different genes within the same sample.'

My questions:

  • Is the baseMean value the final dispersion estimates before fitting the GLM model and testing?
  • Observation: 'Gene A' has the highest transcript count value, 'Gene B' the lowest, 'Gene C' was not identified in the data across all samples. Is this correct without making a comparison of a single gene between the samples? For comparison of a 'Gene A' between samples the log2FoldChange value is used and padj estimates significance.
  • Would a combination of baseMean and log2FoldChange be useful to determine if a gene is highly present (expressed?) in all samples and differentially expressed between samples? Essentially, does baseMean = level of transcript (expression?) overall?

Thank you in advance!

rna-seq deseq2 basemean • 204 views
ADD COMMENTlink modified 5 weeks ago by swbarnes28.6k • written 5 weeks ago by kehoe20
3
gravatar for caggtaagtat
5 weeks ago by
caggtaagtat1.3k
caggtaagtat1.3k wrote:

Hi, so in your observation Gene A would show not the highest transcript count, but highest normalized read count. The amount of reads a gene gets during RNA-seq depends on the expression, but also the gene length, like you mentioned. So therefore, A gene X could theoretically have the same normalized read count (where the normalization does not take into account the gene length) as another gene Y, which is half as short as gene X, but twice as much expressed. To address your last question, I would say, that baseMean does not represent the level of transcript expression (since it does not consider gene length), however, it can give you a quick approximation. Using your example, it would probably be save to assume, that gene B is not 250 times shorter (20000/80) and equally expressed as gene A, which could explain the differences in the normalized read counts. Its more likely, that gene A is indeed higher expressed, however this approximation gets tricky if the baseMean values would be more similar, like 1000 and 2000.

If you, for example use salmon count matrices, they already contain another form of count normalization, which consideres gene length and would be suited better to find highly expressed genes in your data, called TPM (or transcripts per million).

ADD COMMENTlink written 5 weeks ago by caggtaagtat1.3k

Very interesting. I have also used salmon count matrices for this dataset however, did not consider gene length so thoroughly before posting. Thank you for this input!

ADD REPLYlink written 5 weeks ago by kehoe20

Perfect, then you can also gather the TPM information after using the tximport function :) Glad, if that helped

ADD REPLYlink written 5 weeks ago by caggtaagtat1.3k
2
gravatar for swbarnes2
5 weeks ago by
swbarnes28.6k
United States
swbarnes28.6k wrote:

I would note that some library prep methods are biased to one end of the transcript, in which case correcting for transcript length is not appropriate; long genes and short genes both have exactly one 3' end.

So you should find out what library prep was used before you consider adjusting for transcript length.

ADD COMMENTlink written 5 weeks ago by swbarnes28.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1662 users visited in the last hour