Is linearity maintained in linear regression of RNA-Seq?
1
0
Entering edit mode
5.6 years ago
CY ▴ 750

Linear regression is used in cell type decomposition (TIMER or CIBERSORT). I am wondering if the linearity assumption met during modeling.

1. I imagine raw count is not eligible because different library size causes non-linearity. Log-transformed value probably cause non-linearity as well, right? So is there any kind of normalization may fit the linear assumption?

2. Beside, unlike microarray, RNA-Seq library is 0-sum game which is non-linearity (although I don't why this cause non-linearity. This may cause some sort of dependency but the overall expression is still the weighted sum of expression of its conponent, right?)

RNA-Seq • 2.5k views
0
Entering edit mode

RNAseq data (raw counts) can be transformed for linear modeling. Try voom method on RNAseq data.

0
Entering edit mode

Even though logCPM (voom) transformed expression value maintains linearity, we still face 0-sum game issue which cause variables dependence and non-linearity, right. This issue is inherited in the raw data and I can't see any way to fix it.

4
Entering edit mode
5.6 years ago

RNA-seq count data is non-linear and more closely resembles a negative binomial / Poisson-like distribution. For example, running linear regression on RNA-seq counts, normalised or otherwise, is not a great idea. DESeq2 for example, fits a negative binomial regression line through the counts and usually derives its p-value via the Wald test applied to model terms.

If you are looking to use RNA-seq data for cell deconvolution, I would go about obtaining the normalised, transformed counts, such as logCPM (EdgeR), variance-stabilised (DESeq2), or regularised log (DESeq2) expression levels. In EdgeR, you may play around with the prior count that can be added to 0-count genes prior to transformation. DESeq2's transformations deal with these low count genes in its own way.

Personally, I would then obtain Z-scores from the transformed data and use those for deconvolution - this is more readily interpreted. For example, you could regard genes with Z>3 as being highly expressed / representative of a tissue / cell-type, et cetera.

0
Entering edit mode

Thanks Kevin. If I understand you correctly, the methods you suggest normalizes library size and make the counts more normally distributed.

However, The 0-sum game nature of RNA-Seq library still causes varibles dependency and non-linear relationship. Would it compromise linear model a lot?

0
Entering edit mode

To go more in depth into the statistics of this, go to StackExchange (CrossValidated), or even Bioconductor support forum.